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Chapter 1 


Preface: Fast Fourier Transforms’ 


This book focuses on the discrete Fourier transform (DFT), discrete convolution, and, partic- 
ularly, the fast algorithms to calculate them. These topics have been at the center of digital 
signal processing since its beginning, and new results in hardware, theory and applications 
continue to keep them important and exciting. 


As far as we can tell, Gauss was the first to propose the techniques that we now call the fast 
Fourier transform (FFT) for calculating the coefficients in a trigonometric expansion of an 
asteroid’s orbit in 1805 [174]. However, it was the seminal paper by Cooley and Tukey [88] 
in 1965 that caught the attention of the science and engineering community and, in a way, 
founded the discipline of digital signal processing (DSP). 


The impact of the Cooley-Tukey FFT was enormous. Problems could be solved quickly 
that were not even considered a few years earlier. A flurry of research expanded the theory 
and developed excellent practical programs as well as opening new applications [94]. In 
1976, Winograd published a short paper [403] that set a second flurry of research in motion 
[86]. This was another type of algorithm that expanded the data lengths that could be 
transformed efficiently and reduced the number of multiplications required. The ground 
work for this algorithm had be set earlier by Good [148] and by Rader [308]. In 1997 Frigo 
and Johnson developed a program they called the FFTW (fastest Fourier transform in the 
west) [130], [135] which is a composite of many of ideas in other algorithms as well as new 
results to give a robust, very fast system for general data lengths on a variety of computer 
and DSP architectures. This work won the 1999 Wilkinson Prize for Numerical Software. 


It is hard to overemphasis the importance of the DFT, convolution, and fast algorithms. With 
a history that goes back to Gauss [174] and a compilation of references on these topics that 
in 1995 resulted in over 2400 entries [362], the FFT may be the most important numerical 
algorithm in science, engineering, and applied mathematics. New theoretical results still are 
appearing, advances in computers and hardware continually restate the basic questions, and 
new applications open new areas for research. It is hoped that this book will provide the 
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2 CHAPTER 1. PREFACE: FAST FOURIER TRANSFORMS 


background, references, programs and incentive to encourage further research and results in 
this area as well as provide tools for practical applications. 


Studying the FFT is not only valuable in understanding a powerful tool, it is also a prototype 
or example of how algorithms can be made efficient and how a theory can be developed to 
define optimality. The history of this development also gives insight into the process of 
research where timing and serendipity play interesting roles. 


Much of the material contained in this book has been collected over 40 years of teaching and 
research in DSP, therefore, it is difficult to attribute just where it all came from. Some comes 
from my earlier FFT book [59] which was sponsored by Texas Instruments and some from 
the FFT chapter in [217]. Certainly the interaction with people like Jim Cooley and Charlie 
Rader was central but the work with graduate students and undergraduates was probably 
the most formative. I would particularly like to acknowledge Ramesh Agarwal, Howard 
Johnson, Mike Heideman, Henrik Sorensen, Doug Jones, Ivan Selesnick, Haitao Guo, and 
Gary Sitton. Interaction with my colleagues, Tom Parks, Hans Schuessler, Al Oppenheim, 
and Sanjit Mitra has been essential over many years. Support has come from the NSF, Texas 
Instruments, and the wonderful teaching and research environment at Rice University and 
in the IEEE Signal Processing Society. 


Several chapters or sections are written by authors who have extensive experience and depth 
working on the particular topics. Ivan Selesnick had written several papers on the design of 
short FFTs to be used in the prime factor algorithm (PFA) FFT and on automatic design 
of these short FFTs. Markus Piischel has developed a theoretical framework for “Algebraic 
Signal Processing" which allows a structured generation of FFT programs and a system 
called “Spiral" for automatically generating algorithms specifically for an architicture. Steven 
Johnson along with his colleague Matteo Frigo created, developed, and now maintains the 
powerful FFTW system: the Fastest Fourier Transform in the West. I sincerely thank these 
authors for their significant contributions. 


I would also like to thank Prentice Hall, Inc. who returned the copyright on The DFT as 
Convolution or Filtering (Chapter 5) of Advanced Topics in Signal Processing [49] 
around which some of this book is built. The content of this book is in the Connexions 
(http: //cnx.org/content /coll10550/) repository and, therefore, is available for on-line use, 
pdf down loading, or purchase as a printed, bound physical book. I certainly want to 
thank Daniel Williamson, Amy Kavalewitz, and the staff of Connexions for their invaluable 
help. Additional FFT material can be found in Connexions, particularly content by Doug 
Jones [205], Ivan Selesnick [205], and Howard Johnson, [205]. Note that this book and all 
the content in Connexions are copyrighted under the Creative Commons Attribution license 
(http: //creativecommons.org/). 


If readers find errors in any of the modules of this collection or have suggestions for improve- 
ments or additions, please email the author of the collection or module. 
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Chapter 2 


Introduction: Fast Fourier Transforms’ 


The development of fast algorithms usually consists of using special properties of the algo- 
rithm of interest to remove redundant or unnecessary operations of a direct implementation. 
Because of the periodicity, symmetries, and orthogonality of the basis functions and the 
special relationship with convolution, the discrete Fourier transform (DFT) has enormous 
capacity for improvement of its arithmetic efficiency. 


There are four main approaches to formulating efficient DFT [50] algorithms. The first two 
break a DFT into multiple shorter ones. This is done in Multidimensional Index Mapping 
(Chapter 3) by using an index map and in Polynomial Description of Signals (Chapter 4) by 
polynomial reduction. The third is Factoring the Signal Processing Operators (Chapter 6) 
which factors the DFT operator (matrix) into sparse factors. The DFT as Convolution or 
Filtering (Chapter 5) develops a method which converts a prime-length DFT into cyclic 
convolution. Still another approach is interesting where, for certain cases, the evaluation of 
the DFT can be posed recursively as evaluating a DFT in terms of two half-length DFTs 
which are each in turn evaluated by a quarter-length DFT and so on. 


The very important computational complexity theorems of Winograd are stated and briefly 
discussed in Winograd’s Short DFT Algorithms (Chapter 7). The specific details and evalu- 
ations of the Cooley-Tukey FFT and Split-Radix FFT are given in The Cooley-Tukey Fast 
Fourier Transform Algorithm (Chapter 9), and PFA and WFTA are covered in The Prime 
Factor and Winograd Fourier Transform Algorithms (Chapter 10). A short discussion of 
high speed convolution is given in Convolution Algorithms (Chapter 13), both for its own 
importance, and its theoretical connection to the DFT. We also present the chirp, Goertzel, 
QFT, NTT, SR-FFT, Approx FFT, Autogen, and programs to implement some of these. 


Ivan Selesnick gives a short introduction in Winograd’s Short DFT Algorithms (Chapter 7) 
to using Winograd’s techniques to give a highly structured development of short prime length 
FFTs and describes a program that will automaticlly write these programs. Markus Pueschel 
presents his “Algebraic Signal Processing" in DFT and FFT: An Algebraic View (Chapter 8) 
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on describing the various FFT algorithms. And Steven Johnson describes the FFTW (Fastest 
Fourier Transform in the West) in Implementing FFTs in Practice (Chapter 11) 


The organization of the book represents the various approaches to understanding the FFT 
and to obtaining efficient computer programs. It also shows the intimate relationship between 
theory and implementation that can be used to real advantage. The disparity in material 
devoted to the various approaches represent the tastes of this author, not any intrinsic 
differences in value. 


A fairly long list of references is given but it is impossible to be truly complete. I have 
referenced the work that I have used and that I am aware of. The collection of computer 
programs is also somewhat idiosyncratic. They are in Matlab and Fortran because that is 
what I have used over the years. They also are written primarily for their educational value 
although some are quite efficient. There is excellent content in the Connexions book by Doug 
Jones [206]. 
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Chapter 3 


Multidimensional Index Mapping’ 


A powerful approach to the development of efficient algorithms is to break a large problem 
into multiple small ones. One method for doing this with both the DFT and convolution 
uses a linear change of index variables to map the original one-dimensional problem into a 
multi-dimensional problem. This approach provides a unified derivation of the Cooley-Tukey 
FFT, the prime factor algorithm (PFA) FFT, and the Winograd Fourier transform algorithm 
(WFTA) FFT. It can also be applied directly to convolution to break it down into multiple 
short convolutions that can be executed faster than a direct implementation. It is often easy 
to translate an algorithm using index mapping into an efficient program. 


The basic definition of the discrete Fourier transform (DFT) is 


N-1 
C(k) = S a(n) Writ (3.1) 
n=0 
where n, k, and N are integers, 7 = //—1, the basis functions are the N roots of unity, 


Wy =e 220/N (3.2) 
and k= 0,1,2,---,N—1. 


If the N values of the transform are calculated from the N values of the data, x(n), it is 
easily seen that NV? complex multiplications and approximately that same number of complex 
additions are required. One method for reducing this required arithmetic is to use an index 
mapping (a change of variables) to change the one-dimensional DFT into a two- or higher 
dimensional DFT. This is one of the ideas behind the very efficient Cooley-Tukey [89] and 
Winograd [404] algorithms. The purpose of index mapping is to change a large problem into 
several easier ones [46], [120]. This is sometimes called the “divide and conquer" approach [26] 
but a more accurate description would be “organize and share" which explains the process 
of redundancy removal or reduction. 
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3.1 The Index Map 


For a length-N sequence, the time index takes on the values 


n=0,1,2,...,N—1 (3.3) 


When the length of the DFT is not prime, N can be factored as N = N,N. and two new 
independent variables can be defined over the ranges 


ny = 0,1,2,...,N,—1 (3.4) 


ng = 0,1,2,...,No—1 (3.5) 
A linear change of variables is defined which maps n; and nz to n and is expressed by 


n= ((Kyn, + Kone) y (3.6) 


where K; are integers and the notation ((x)),, denotes the integer residue of « modulo 
N|232]. This map defines a relation between all possible combinations of n; and ng in (3.4) 
and (3.5) and the values for n in (3.3). The question as to whether all of the n in (3.3) are 
represented, i.e., whether the map is one-to-one (unique), has been answered in [46] showing 
that certain integer A; always exist such that the map in (3.6) is one-to-one. Two cases must 
be considered. 


3.1.1 Case 1. 
N, and Np» are relatively prime, i.e., the greatest common divisor (Nj, No) = 1. 
The integer map of (3.6) is one-to-one if and only if: 


(Kk, =aN2) and/or (K2=bN,) and (Ki,N,) = (Ko, No) =1 (3:7) 


where a and 0 are integers. 


3.1.2 Case 2. 
N, and No are not relatively prime, ie., (NV, No) > 1. 
The integer map of (3.6) is one-to-one if and only if: 


(Ky = aN2) and (Ko - bN;) and (a, Ni) = (Ko, No) =] (3.8) 


or 


(ky a aN2) and (Ko = bN;) and (11, N1) = (b, No) =] (3.9) 


Reference [46] should be consulted for the details of these conditions and examples. Two 
classes of index maps are defined from these conditions. 
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3.1.3 Type-One Index Map: 


The map of (3.6) is called a type-one map when integers a and b exist such that 


ky = aN» and Ko — bN, (3.10) 


3.1.4 Type-Two Index Map: 


The map of (3.6) is called a type-two map when when integers a and 6 exist such that 


ky = aNo or Ko = bN,, but not both. (3.11) 


The type-one can be used only if the factors of N are relatively prime, but the type-two 
can be used whether they are relatively prime or not. Good [149], Thomas, and Winograd 
[404] all used the type-one map in their DFT algorithms. Cooley and Tukey [89] used the 
type-two in their algorithms, both for a fixed radix (N = ) and a mixed radix [301]. 


The frequency index is defined by a map similar to (3.6) as 


k = ((K3ky + K4k)) y (3.12) 
where the same conditions, (3.7) and (3.8), are used for determining the uniqueness of this 
map in terms of the integers K3 and K4. 

Two-dimensional arrays for the input data and its DFT are defined using these index maps 
to give 
x (n1, 2) = x((kyny, + Kn2)) x (3.13) 
xX (ki, ky) = X ((K3ky + K4k2)) y (3.14) 
In some of the following equations, the residue reduction notation will be omitted for clarity. 


These changes of variables applied to the definition of the DFT given in (3.1) give 


N2-1Ni-1 
C(k) = s S a(n) Wee i ala es Weare (3.15) 


n2=0n,=0 


where all of the exponents are evaluated modulo N. 


The amount of arithmetic required to calculate (3.15) is the same as in the direct calculation 
of (3.1). However, because of the special nature of the DFT, the integer constants K; can be 
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chosen in such a way that the calculations are “uncoupled" and the arithmetic is reduced. 
The requirements for this are 


((KiK1))y=0 and/or ((K2K3))y =0 (3.16) 


When this condition and those for uniqueness in (3.6) are applied, it is found that the K; 
may always be chosen such that one of the terms in (3.16) is zero. If the N; are relatively 
prime, it is always possible to make both terms zero. If the N; are not relatively prime, only 
one of the terms can be set to zero. When they are relatively prime, there is a choice, it is 
possible to either set one or both to zero. This in turn causes one or both of the center two 
W terms in (3.15) to become unity. 


An example of the Cooley-Tukey radix-4 FFT for a length-16 DFT uses the type-two map 
with Ky = A, Ko = 1, K3 — 1, Ky =4 giving 


k = ky + 4kp (3.18) 


The residue reduction in (3.6) is not needed here since n does not exceed N as n; and ng 
take on their values. Since, in this example, the factors of N have a common factor, only 
one of the conditions in (3.16) can hold and, therefore, (3.15) becomes 


A 


3 3 
C (ki, ke) =C(k) = SO S7 a(n) We Wet Wee (3.19) 


n2=0n1=0 
Note the definition of Wy in (3.3) allows the simple form of W/.'** = Wy 


This has the form of a two-dimensional DFT with an extra term Wj., called a “twiddle 
factor". The inner sum over n; represents four length-4 DFTs, the Wig term represents 16 
complex multiplications, and the outer sum over nz represents another four length-4 DFTs. 
This choice of the AK; “uncouples" the calculations since the first sum over n; for no = 0 


A 


calculates the DFT of the first row of the data array v (n1,7n2), and those data values are 
never needed in the succeeding row calculations. The row calculations are independent, and 
examination of the outer sum shows that the column calculations are likewise independent. 
This is illustrated in Figure 3.1. 
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Figure 3.1: Uncoupling of the Row and Column Calculations (Rectangles are Data 
Arrays) 





The left 4-by-4 array is the mapped input data, the center array has the rows transformed, 
and the right array is the DFT array. The row DFTs and the column DFTs are independent of 
each other. The twiddle factors (TF) which are the center W in (3.19), are the multiplications 
which take place on the center array of Figure 3.1. 


This uncoupling feature reduces the amount of arithmetic required and allows the results of 
each row DFT to be written back over the input data locations, since that input row will 
not be needed again. This is called “in-place" calculation and it results in a large memory 
requirement savings. 


An example of the type-two map used when the factors of N are relatively prime is given 
for N = 15 as 


n= 5n,+ Nn (3.20) 


k = ky + 3kp (3.21) 


The residue reduction is again not explicitly needed. Although the factors 3 and 5 are 
relatively prime, use of the type-two map sets only one of the terms in (3.16) to zero. The 
DFT in (3.15) becomes 


4 2 
c=) > aye we we (3.22) 
n2=0n1=0 
which has the same form as (3.19), including the existence of the twiddle factors (TF). Here 
the inner sum is five length-3 DFTs, one for each value of k;. This is illustrated in (3.2) 
where the rectangles are the 5 by 3 data arrays and the system is called a “mixed radix" 
FFT. 
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Figure 3.2: Uncoupling of the Row and Column Calculations (Rectangles are Data 
Arrays) 





An alternate illustration is shown in Figure 3.3 where the rectangles are the short length 3 
and 5 DFTs. 



































Figure 3.3: Uncoupling of the Row and Column Calculations (Rectangles are Short 
DFTs) 





The type-one map is illustrated next on the same length-15 example. This time the situation 
of (3.7) with the “and" condition is used in (3.10) using an index map of 


n = 5n, + 3ng (3.23) 


and 
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The residue reduction is now necessary. Since the factors of N are relatively prime and the 
type-one map is being used, both terms in (3.16) are zero, and (3.15) becomes 


A 


A d 2 
Ke yy oe wre (3.25) 
n2=0n1,=0 


which is similar to (3.22), except that now the type-one map gives a pure two-dimensional 
DFT calculation with no TFs, and the sums can be done in either order. Figures Figure 3.2 
and Figure 3.3 also describe this case but now there are no Twiddle Factor multiplications 
in the center and the resulting system is called a “prime factor algorithm" (PFA). 


The purpose of index mapping is to improve the arithmetic efficiency. For example a direct 
calculation of a length-16 DFT requires 16*2 or 256 real multiplications (recall, one complex 
multiplication requires 4 real multiplications and 2 real additions) and an uncoupled version 
requires 144. A direct calculation of a length-15 DFT requires 225 multiplications but with 
a type-two map only 135 and with a type-one map, 120. Recall one complex multiplication 
requires four real multiplications and two real additions. 


Algorithms of practical interest use short DFT’s that require fewer than N? multiplications. 
For example, length-4 DFTs require no multiplications and, therefore, for the length-16 DFT, 
only the TFs must be calculated. That calculation uses 16 multiplications, many fewer than 
the 256 or 144 required for the direct or uncoupled calculation. 


The concept of using an index map can also be applied to convolution to convert a length N = 
N, Nz one-dimensional cyclic convolution into a N, by Nz two-dimensional cyclic convolution 
[46], [6]. There is no savings of arithmetic from the mapping alone as there is with the DFT, 
but savings can be obtained by using special short algorithms along each dimension. This is 
discussed in Algorithms for Data with Restrictions (Chapter 12) . 


3.2 In-Place Calculation of the DFT and Scrambling 


Because use of both the type-one and two index maps uncouples the calculations of the 
rows and columns of the data array, the results of each short length N; DFT can be written 
back over the data as it will not be needed again after that particular row or column is 
transformed. This is easily seen from Figures Figure 3.1, Figure 3.2, and Figure 3.3 where 
the DFT of the first row of x (n1, 2) can be put back over the data rather written into a new 
array. After all the calculations are finished, the total DFT is in the array of the original 
data. This gives a significant memory savings over using a separate array for the output. 


Unfortunately, the use of in-place calculations results in the order of the DFT values being 
permuted or scrambled. This is because the data is indexed according to the input map 
(3.6) and the results are put into the same locations rather than the locations dictated by 
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the output map (3.12). For example with a length-8 radix-2 FFT, the input index map is 


n = 4n, + 2ng + ng (3.26) 
which to satisfy (3.16) requires an output map of 


k = ky + 2ky + 4kg (3.27) 


The in-place calculations will place the DFT results in the locations of the input map 
and these should be reordered or unscrambled into the locations given by the output map. 
Examination of these two maps shows the scrambled output to be in a “bit reversed" order. 


For certain applications, this scrambled output order is not important, but for many applica- 
tions, the order must be unscrambled before the DFT can be considered complete. Because 
the radix of the radix-2 FFT is the same as the base of the binary number representation, the 
correct address for any term is found by reversing the binary bits of the address. The part 
of most FFT programs that does this reordering is called a bit-reversed counter. Examples 
of various unscramblers are found in [146], [60] and in the appendices. 


The development here uses the input map and the resulting algorithm is called “decimation- 
in-frequency". If the output rather than the input map is used to derive the FFT algorithm 
so the correct output order is obtained, the input order must be scrambled so that its values 
are in locations specified by the output map rather than the input map. This algorithm 
is called “decimation-in-time". The scrambling is the same bit-reverse counting as before, 
but it precedes the FFT algorithm in this case. The same process of a post-unscrambler or 
pre-scrambler occurs for the in-place calculations with the type-one maps. Details can be 
found in [60], [56]. It is possible to do the unscrambling while calculating the FFT and to 
avoid a separate unscrambler. This is done for the Cooley-Tukey FFT in [192] and for the 
PFA in [60], [56], [319]. 


If a radix-2 FFT is used, the unscrambler is a bit-reversed counter. If a radix-4 FFT is used, 
the unscrambler is a base-4 reversed counter, and similarly for radix-8 and others. However, 
if for the radix-4 FFT, the short length-4 DFTs (butterflies) have their outputs in bit-revered 
order, the output of the total radix-4 FFT will be in bit-reversed order, not base-4 reversed 
order. This means any radix-2” FFT can use the same radix-2 bit-reversed counter as an 
unscrambler if the proper butterflies are used. 


3.3 Efficiencies Resulting from Index Mapping with the 
DFT 


In this section the reductions in arithmetic in the DFT that result from the index mapping 
alone will be examined. In practical algorithms several methods are always combined, but 
it is helpful in understanding the effects of a particular method to study it alone. 
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The most general form of an uncoupled two-dimensional DFT is given by 


No-1 Ni-1 
xX (K1, 2) = S- { S- XL (ni, N9) fi (n1, 2, k,)} fo (no, ky, ky) (3.28) 
n2=0 n1=0 
where the inner sum calculates No length-N,; DFT’s and, if for a type-two map, the effects 
of the TFs. If the number of arithmetic operations for a length-N DFT is denoted by F' (NV), 
the number of operations for this inner sum is F’ = NjF'(N,). The outer sum which gives Nj; 
length-N  DFT’s requires NF (N2) operations. The total number of arithmetic operations 
is then 


F = N2F(N,) + N\F (No) (3.29) 


The first question to be considered is for a fixed length N, what is the optimal relation of 

N, and Np2 in the sense of minimizing the required amount of arithmetic. To answer this 
question, N1 and N»2 are temporarily assumed to be real variables rather than integers. If 
the short length-N; DFT’s in (3.28) and any TF multiplications are assumed to require N? 
operations, i.e. F'(N;) = N?, "Efficiencies Resulting from Index Mapping with the DFT" 
(Section 3.3: Efficiencies Resulting from Index Mapping with the DFT) becomes 


F = N,N? + N\NZ = N(N, + No) = N(N, + NN7") (3.30) 
To find the minimum of F' over Nj, the derivative of F' with respect to Nj is set to zero 
(temporarily assuming the variables to be continuous) and the result requires N,; = No. 


This result is also easily seen from the symmetry of N; and No in N = N,Ngo. If a more 
general model of the arithmetic complexity of the short DFT’s is used, the same result 
is obtained, but a closer examination must be made to assure that N, = No is a global 
minimum. 


If only the effects of the index mapping are to be considered, then the F (N) = N? model is 
used and (3.31) states that the two factors should be equal. If there are M factors, a similar 
reasoning shows that all MM factors should be equal. For the sequence of length 

N=R” (332) 


there are now M length-R DFT’s and, since the factors are all equal, the index map must 
be type two. This means there must be twiddle factors. 


In order to simplify the analysis, only the number of multiplications will be considered. If 
the number of multiplications for a length-R DFT is F'(R), then the formula for operation 
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counts in (3.30) generalizes to 


F=NY-F(N))/Ni:=NMF(R)/R (3.33) 
for N; =R : 
F = NinR(N) F(R) /R=(NinN) (F(R) /(RinR)) (3.34) 


This is a very important formula which was derived by Cooley and Tukey in their famous 
paper [89] on the FFT. It states that for a given R which is called the radix, the number of 
multiplications (and additions) is proportional to NinN. It also shows the relation to the 
value of the radix, R. 


In order to get some idea of the “best" radix, the number of multiplications to compute a 
length-R DFT is assumed to be F(R) = R®*. If this is used with (3.34), the optimal R can 
be found. 


dP/dR=0 = Rael (3.35) 
For x = 2 this gives R = e, with the closest integer being three. 


The result of this analysis states that if no other arithmetic saving methods other than index 
mapping are used, and if the length-R DFT’s plus TFs require F = R? multiplications, the 
optimal algorithm requires 


F = 3Nlog,N (3.36) 


multiplications for a length N = 3” DFT. Compare this with N? for a direct. calculation 
and the improvement is obvious. 


While this is an interesting result from the analysis of the effects of index mapping alone, in 
practice, index mapping is almost always used in conjunction with special algorithms for the 
short length-N; DFT’s in (3.15). For example, if R = 2 or 4, there are no multiplications 
required for the short DFT’s. Only the TFs require multiplications. Winograd (see Winorad’s 
Short DFT Algorithms (Chapter 7)) has derived some algorithms for short DFT’s that 
require O(N) multiplications. This means that F'(N;) = KN; and the operation count 
F in "Efficiencies Resulting from Index Mapping with the DFT" (Section 3.3: Efficiencies 
Resulting from Index Mapping with the DFT) is independent of N;. Therefore, the derivative 
of F' is zero for all N;. Obviously, these particular cases must be examined. 


3.4 The FFT as a Recursive Evaluation of the DFT 


It is possible to formulate the DFT so a length-N DFT can be calculated in terms of two 
length-(N/2) DFTs. And, if N = 2”, each of those length-(N/2) DFTs can be found in 
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terms of length-(N/4) DFTs. This allows the DFT to be calculated by a recursive algorithm 
with M recursions, giving the familiar order Nlog (N) arithmetic complexity. 


Calculate the even indexed DFT values from (3.1) by: 


N-1 
C(2k)= S a(n) Wee = Yet ) Wiis (3.37) 
n=0 
N/2-1 N-1 
C(2k)= S > a(n) Wak + SO cn) Wri, (3.38) 
n=0 n=N/2 
N/2-1 
ae {a(n) + 2(n+ N/2)} W (3.39) 
and a similar argument gives the a indexed values as: 
N/2-1 
C (2k +1) > {a(n) — x(n+N/2)} Wr Wri, (3.40) 


Together, these are recursive DFT formulas expressing the length-N DFT of x (n) in terms 
of length-N /2 DFTs: 


C (2k) = DFT yje{z(n) + x(n+N/2)} (3.41) 





C (2k +1) =DFTypo{[z(n) — «(n+ N/2)|Wa} (3.42) 


This is a “decimation-in-frequency" (DIF) version since it gives samples of the frequency 
domain representation in terms of blocks of the time domain signal. 


A recursive Matlab program which implements this is given by: 
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function c = dftr2(x) 
/% Recursive Decimation-in-Frequency FFT algorithm, csb 8/21/07 
L = length(x) ; 


if L>i 
L2 = L/2; 
TF = exp(-j*2*pi/L) .~[0:L2-1]; 
c1 = dftr2( x(1:L2) + x(L2+1:L)); 
c2 = dftr2((x(1:L2) - x(L2+1:L)).*TF); 
egi=: fet? >c27 3 
c = cc(:); 
else 
Cc =X; 
end 


Listing 3.1: DIF Recursive FFT for N = 2” 





A DIT version can be derived in the form: 
C (k) = DFT wya{z (2n)} + WHDFT yyo{zx (2n + 1)} 
C(k + N/2) = DFT yofa (2n)} — WRDFT yjo{x (2n + 1)} 


which gives blocks of the frequency domain from samples of the signal. 


A recursive Matlab program which implements this is given by: 
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function c = dftr(x) 

/% Recursive Decimation-in-Time FFT algorithm, csb 
L = length(x) ; 
ifL>i 

LO = L/23 

ce = dftr(x(1:2:L-1)); 

co = dftr(x(2:2:L)); 

TF = exp(-j*2*pi/L) .~[0:L2-1]; 


c1 = TF.*co; 

c = [(cetc1), (ce-c1)]; 
else 

Cc = X35 


end 


Listing 3.2: DIT Recursive FFT for N = 2” 





Similar recursive expressions can be developed for other radices and and algorithms. Most 
recursive programs do not execute as efficiently as looped or straight code, but some can be 
very efficient, e.g. parts of the FFTW. 


Note a length-2™ sequence will require M recursions, each of which will require N/2 multi- 
plications. This give the Nlog (N) formula that the other approaches also derive. 
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Chapter 4 


Polynomial Description of Signals’ 


Polynomials are important in digital signal processing because calculating the DFT can be 
viewed as a polynomial evaluation problem and convolution can be viewed as polynomial 
multiplication [27], [261]. Indeed, this is the basis for the important results of Winograd 
discussed in Winograd’s Short DFT Algorithms (Chapter 7). A length-N signal x(n) will 
be represented by an N — 1 degree polynomial X (s) defined by 


X (s) = SS (n) s” (4.1) 


This polynomial X (s) is a single entity with the coefficients being the values of x(n). It 
is somewhat similar to the use of matrix or vector notation to efficiently represent signals 
which allows use of new mathematical tools. 


The convolution of two finite length sequences, x(n) and h(n), gives an output sequence 
defined by 


y(n) = Soa (k) h(n—k) (4.2) 


n = 0,1,2,--- ,2N —1 where h(k) = 0 for k < 0. This is exactly the same operation as 
calculating the coefficients when multiplying two polynomials. Equation (4.2) is the same as 


Y (s) =X (s) H(s) (4.3) 


In fact, convolution of number sequences, multiplication of polynomials, and the multipli- 

cation of integers (except for the carry operation) are all the same operations. To obtain 
cyclic convolution, where the indices in (4.2) are all evaluated modulo N, the polynomial 
multiplication in (4.3) is done modulo the polynomial P (s) = s% — 1. This is seen by noting 
that N = 0 mod N, therefore, s“ = 1 and the polynomial modulus is s% — 1. 
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4.1 Polynomial Reduction and the Chinese Remainder 
Theorem 


Residue reduction of one polynomial modulo another is defined similarly to residue reduction 
for integers. A polynomial F’ (s) has a residue polynomial R(s) modulo P(s) if, for a given 
F(s) and P(s),a Q(S) and R(s) exist such that 


F(s) =Q(s) P(s) + R(s) (4.4) 
with degree{R(s)} < degree{P (s)}. The notation that will be used is 
R(s)=(F (5))) p(s) (4.5) 
For example, 
(s+ 1) = ((s*+s*-s— 1) ais (4.6) 


The concepts of factoring a polynomial and of primeness are an extension of these ideas 
for integers. For a given allowed set of coefficients (values of x(n)), any polynomial has a 
unique factored representation 


F(s)=]] Fils)" (4.7) 
where the F;(s) are relatively prime. This is analogous to the fundamental theorem of 
arithmetic. 


There is a very useful operation that is an extension of the integer Chinese Remainder 
Theorem (CRT) which says that if the modulus polynomial can be factored into relatively 
prime factors 


PS) = Pts) seas) (4.8) 


then there exist two polynomials, AK, (s) and K2(s), such that any polynomial F'(s) can be 
recovered from its residues by 


F (s) = Ki (s) Fi (s) + Ko(s) F(s) mod P(s) (4.9) 


where F and F» are the residues given by 


and 
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if the order of F'(s) is less than P(s). This generalizes to any number of relatively prime 
factors of P(s) and can be viewed as a means of representing F'(s) by several lower degree 
polynomials, F; (s). 


This decomposition of F'(s) into lower degree polynomials is the process used to break a 
DFT or convolution into several simple problems which are solved and then recombined using 
the CRT of (4.9). This is another form of the “divide and conquer" or “organize and share" 
approach similar to the index mappings in Multidimensional Index Mapping (Chapter 3). 


One useful property of the CRT is for convolution. If cyclic convolution of x(n) and h (n) is 
expressed in terms of polynomials by 


Y (s) = H(s)X (s) mod P(s) (4.12) 


where P(s) = s% — 1, and if P(s) is factored into two relatively prime factors P = P,P», 
using residue reduction of H(s) and X(s) modulo P; and Py», the lower degree residue 
polynomials can be multiplied and the results recombined with the CRT. This is done by 


Y (s) = (A AX] + KoH2X2))p (4.13) 
where 


M=((A))p, X1=(X))p, He =())p, Xo = ((X))p, (4.14) 


and K1 and K2 are the CRT coefficient polynomials from (4.9). This allows two shorter 
convolutions to replace one longer one. 


Another property of residue reduction that is useful in DFT calculation is polynomial eval- 
uation. To evaluate F'(s) at s =x, F’(s) is reduced modulo s — x. 


F(z) = ((F(8))),-« (4.15) 
This is easily seen from the definition in (4.4) 


F(s) =Q(s)(s—2)+ R(s) (4.16) 
Evaluating s = x gives R(s) = F(x) which is a constant. For the DFT this becomes 


C'(k) = ((X (8))) ._we (4.17) 
Details of the polynomial algebra useful in digital signal processing can be found in [27], 
[233], [261]. 
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4.2 The DFT as a Polynomial Evaluation 
The Z-transform of a number sequence x(n) is defined as 
N=) aye (4.18) 
n=0 


which is the same as the polynomial description in (4.1) but with a negative exponent. For 
a finite length-N sequence (4.18) becomes 


ee = >: (i) ar" (4.19) 
X(z)=e2 (0) +e (De +22) 27 +e+e (N= 177" (4.20) 


This N — 1 order polynomial takes on the values of the DFT of x(n) when evaluated at 


Zaere (4.21) 


which gives 


N-1 
CO EX Le aaa Soa (n) e~d2mnk/N (4.22) 
n=0 
In terms of the positive exponent polynomial from (4.1), the DFT is 


C(k) = X (s) | ye (4.23) 


where 


W =e n/N (4.24) 


is an N“ root of unity (raising W to the N“ power gives one). The N values of the DFT 
are found from X (s) evaluated at the N N“” roots of unity which are equally spaced around 
the unit circle in the complex s plane. 


One method of evaluating X (z) is the so-called Horner’s rule or nested evaluation. When 
expressed as a recursive calculation, Horner’s rule becomes the Goertzel algorithm which has 
some computational advantages especially when only a few values of the DFT are needed. 
The details and programs can be found in [272], [61] and The DFT as Convolution or Filter- 
ing: Goertzel’s Algorithm (or A Better DFT Algorithm) (Section 5.3: Goertzel’s Algorithm 
(or A Better DFT Algorithm)) 
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Another method for evaluating X (s) is the residue reduction modulo (s — W*) as shown in 
(4.17). Each evaluation requires N multiplications and therefore, N? multiplications for the 
N values of C (k). 


C(k) = (X (8) (we) (4.25) 
A considerable reduction in required arithmetic can be achieved if some operations can be 
shared between the reductions for different values of k. This is done by carrying out the 


residue reduction in stages that can be shared rather than done in one step for each k in 
(4.25). 


The N values of the DFT are values of X (s) evaluated at s equal to the N roots of the 
polynomial P (s) = s% — 1 which are W*. First, assuming N is even, factor P(s) as 


P(s) = (s% —1) = Pi(s) Po(s) = (8%? - 1) (8%? +1) (4.26) 

X (s) is reduced modulo these two factors to give two residue polynomials, X,(s) and 

X»_(s). This process is repeated by factoring P, and further reducing X, then factoring P; 

and reducing X. This is continued until the factors are of first degree which gives the desired 

DFT values as in (4.25). This is illustrated for a length-8 DFT. The polynomial whose roots 
are W*, factors as 


P(s)=s'-1 (4.27) 

= [s*-1] [s*+1] (4.28) 

= [(8 —1) (8° +] [(s?— 3) (8° +9)] (4.29) 

=I(s-1) (s+) (s— J) (s+ a) [(s — @) (s + @) (8 — ja) (8 + Ja)] (4.30) 


where a? = j. Reducing X (s) by the first factoring gives two third degree polynomials 


X (s) Sao + as + wos? +... + x78" (4.31) 
gives the residue polynomials 


X1 (8) = ((X (8)))¢g1-1) = (Wo + @4) + (@1 + #5) 5 + (a2 + 26) 8° + (wg t+27)8° (4.32) 














X2(s) = ((X (s)))(g41) = (Wo — 4) + (@1 — a5) 8 + (#2 — 26) 8° + (43-27) 8° (4.33) 
Two more levels of reduction are carried out to finally give the DFT. Close examination 
shows the resulting algorithm to be the decimation-in-frequency radix-2 Cooley-Tukey FFT 
[272], [61]. Martens [227] has used this approach to derive an efficient DFT algorithm. 


Other algorithms and types of FFT can be developed using polynomial representations and 
some are presented in the generalization in DFT and FFT: An Algebraic View (Chapter 8). 
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Chapter 5 


The DEFT as Convolution or Filtering’ 


A major application of the FFT is fast convolution or fast filtering where the DFT of the 
signal is multiplied term-by-term by the DFT of the impulse (helps to be doing finite impulse 
response (FIR) filtering) and the time-domain output is obtained by taking the inverse DFT 
of that product. What is less well-known is the DFT can be calculated by convolution. 
There are several different approaches to this, each with different application. 


5.1 Rader’s Conversion of the DFT into Convolution 


In this section a method quite different from the index mapping or polynomial evaluation 
is developed. Rather than dealing with the DFT directly, it is converted into a cyclic 
convolution which must then be carried out by some efficient means. Those means will 
be covered later, but here the conversion will be explained. This method requires use of 
some number theory, which can be found in an accessible form in [234] or [262] and is easy 
enough to verify on one’s own. A good general reference on number theory is [259]. 


The DFT and cyclic convolution are defined by 


C(k) = S$ x(n) w™ (5.1) 


y(k) = Sox (n) h(k—n) (5.2) 


For both, the indices are evaluated modulo N. In order to convert the DFT in (5.1) into the 
cyclic convolution of (5.2), the nk product must be changed to the k — n difference. With 
real numbers, this can be done with logarithms, but it is more complicated when working 
in a finite set of integers modulo N. From number theory [28], [234], [262], [259], it can be 
shown that if the modulus is a prime number, a base (called a primitive root) exists such 
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that a form of integer logarithm can be defined. This is stated in the following way. If N is 
a prime number, a number r called a primitive roots exists such that the integer equation 


n=((r"))y (5.3) 
creates a unique, one-to-one map of the N —1 member set m = {0,..., N—2} and the N—1 
member set n = {1,..., N — 1}. This is because the multiplicative group of integers modulo 
a prime, p, is isomorphic to the additive group of integers modulo (p — 1) and is illustrated 
for N = 5 below. 
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Table 5.1: Table of Integers n = ((r™)) modulo 5, [* not defined] 


Table 5.1 is an array of values of r™ modulo N and it is easy to see that there are two 
primitive roots, 2 and 3, and (5.3) defines a permutation of the integers n from the integers 
m (except for zero). (5.3) and a primitive root (usually chosen to be the smallest of those 
that exist) can be used to convert the DFT in (5.1) to the convolution in (5.2). Since (5.3) 
cannot give a zero, a new length-(N-1) data sequence is defined from x(n) by removing the 
term with index zero. Let 


aie. (5.4) 
and 
k= (5.5) 


where the term with the negative exponent (the inverse) is defined as the integer that 
satisfies 


(rr) y= (5.6) 

If N is a prime number, r~™ always exists. For example, ((2~*)), = 3. (5.1) now becomes 
N-2 

C(iry= Sox (r-™) wre" + 2 (0), (5.7) 
m=0 
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for s =0,1,..,N — 2, and 


C (0) = So x(n) (5.8) 


New functions are defined, which are simply a permutation in the order of the original 
functions, as 


cinsnr™). OC ESsCR); Ww =w (5.9) 
(5.7) then becomes 
= Ye (m) (s—m) + 2(0) (5.10) 
which is cyclic convolution of length N-1 (plus x (0)) and is denoted as 
C (k) =a (k)*W (k) +2 (0) (5.11) 


Applying this change of variables (use of logarithms) to the DFT can best be illustrated 
from the matrix formulation of the DFT. (5.1) is written for a length-5 DFT as 


C'(0) 0000 0] |} «(0) 
C(1) 01 23 4) | #@) 
CO), = Or ara a Be) |) 0) (5.12) 
C'(3) 03 14 2] |} «(3) 
C (4) 0; 4-320) || bana) 


where the square matrix should contain the terms of W”* but for clarity, only the exponents 
nk are shown. Separating the x(0) term, applying the mapping of (5.9), and using the 
primitive roots r = 2 (and r~' = 3) gives 





C (1) 1. BAD x (1) x (0) 
C (2) _ 2 1.3 4 x (3) E x (0) (5.13) 
C (4) 4213 x (4) x (0) 
C (3) BAe 2 x (2) x (0) 
and 
C (0) =2(0)+ 2(1) + 2(2)+2(3) +2 (4) (5.14) 


which can be seen to be a reordering of the structure in (5.12). This is in the form of cyclic 
convolution as indicated in (5.10). Rader first showed this in 1968 [234], stating that a prime 
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length-N DFT could be converted into a length-(N-1) cyclic convolution of a permutation 
of the data with a permutation of the W’s. He also stated that a slightly more complicated 
version of the same idea would work for a DFT with a length equal to an odd prime to a 
power. The details of that theory can be found in [234], [169]. 


Until 1976, this conversion approach received little attention since it seemed to offer few 
advantages. It has specialized applications in calculating the DFT if the cyclic convolution 
is done by distributed arithmetic table look-up [77] or by use of number theoretic transforms 
[28], [234], [262]. It and the Goertzel algorithm [273], [62] are efficient when only a few DFT 
values need to be calculated. It may also have advantages when used with pipelined or vector 
hardware designed for fast inner products. One example is the TMS320 signal processing 
microprocessor which is pipelined for inner products. The general use of this scheme emerged 
when new fast cyclic convolution algorithms were developed by Winograd [405]. 


5.2 The Chirp Z-Transform (or Bluestein’s Algorithm) 


The DFT of x (n) evaluates the Z-transform of x(n) on N equally spaced points on the unit 
circle in the z plane. Using a nonlinear change of variables, one can create a structure which 
is equivalent to modulation and filtering x(n) by a “chirp" signal. [34], [306], [298], [273], 
[304], [62]. 


The mathematical identity (k — n)* = k? — 2kn + n? gives 


nk = (n? —(k—n)* +k’) /2 (5.15) 


which substituted into the definition of the DFT in Multidimensional Index Mapping: Equa- 
tion 1 (3.1) gives 


N-1 
c(h) ={> [ee weP] wen} wer (5.16) 
n=0 
This equation can be interpreted as first multiplying (modulating) the data x (n) by a chirp 
sequence (W”’/2, then convolving (filtering) it, then finally multiplying the filter output by 
the chirp sequence to give the DFT. 


Define the chirp sequence or signal as h(n) = wr’/? which is called a chirp because the 


squared exponent gives a sinusoid with changing frequency. Using this definition, (5.16) 
becomes 


C(n) ={[x(n) h(n)] * hot} h(n) (5.17) 


We know that convolution can be carried out by multiplying the DFTs of the signals, here 
we see that evaluation of the DFT can be carried out by convolution. Indeed, the convolution 
represented by * in (5.17) can be carried out by DFTs (actually FFTs) of a larger length. 
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This allows a prime length DFT to be calculated by a very efficient length-2” FFT. This 
becomes practical for large N when a particular non-composite (or N with few factors) 
length is required. 


As developed here, the chirp z-transform evaluates the z-transform at equally spaced points 
on the unit circle. A slight modification allows evaluation on a spiral and in segments [298], 
[273] and allows savings with only some input values are nonzero or when only some output 
values are needed. The story of the development of this transform is given in [304]. 


Two Matlab programs to calculate an arbitrary length DFT using the chirp z-transform is 
shown in p. ??. 
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function y = chirpc(x); 
% function y = chirpc(x) 
/% computes an arbitrary-length DFT with the 
/% chirp z-transform algorithm. csb. 6/12/91 
A 


N = length(x); n = 0:N-1; y*Sequence length 

W = exp(-j*pit*n.*n/N) ; /#Chirp signal 

xw = x.*W; 7Modulate with chirp 
WW = [conj(W(N:-1:2)),conj(W)]; %Construct filter 

y = conv(WW,xw) ; %Convolve w filter 

y = y(N:2«N-1) .+W; /*Demodulate w chirp 


function y = chirp(x); 

% function y = chirp(x) 

% computes an arbitrary-length Discrete Fourier Transform (DFT) 

/% with the chirp z transform algorithm. The linear convolution 

/% then required is done with FFTs. 

* 1988: L. Arevalo; 11.06.91 K. Schwarz, LNT Erlangen; 6/12/91 csb. 
A 


N = length(x); “Sequence length 
L = 2°>ceil(log((2*N-1))/log(2)); “FFT length 
n = 0:N-1; 
W = exp(-j*pi*n.*n/N) ; /#Chirp signal 
FW = fft({[conj(W), zeros(1,L-2*N+1), conj(W(N:-1:2))],L); 
y = ifft(FW.*fft(x.’.*W,L)); %Convolve using FFT 
y = y(1:N).*W; /7Demodulate 
Figure 5.1 





5.3 Goertzel’s Algorithm (or A Better DFT Algorithm) 


Goertzel’s algorithm [144], [62], [269] is another methods that calculates the DFT by con- 
verting it into a digital filtering problem. The method looks at the calculation of the DFT 
as the evaluation of a polynomial on the unit circle in the complex plane. This evaluation is 
done by Horner’s method which is implemented recursively by an IIR filter. 
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5.3.1 The First-Order Goertzel Algorithm 


The polynomial whose values on the unit circle are the DFT is a slightly modified z-transform 
of x(n) given by 


Mile) Soa Caisse? (5.18) 


which for clarity in this development uses a positive exponent . This is illustrated for a 
length-4 sequence as a third-order polynomial by 


X(ZH=23)24+c2Q)e+20)2e+e(0) (5.19) 
The DFT is found by evaluating (5.18) at z = W*, which can be written as 
C(k) =X (2) aye = DFT {z (n)} (5.20) 
where 
W =e n/N (5.21) 
The most efficient way of evaluating a general polynomial without any pre-processing is by 
“Horner’s rule" [208] which is a nested evaluation. This is illustrated for the polynomial in 
(5.19) by 
X (z) = {[# (3) z+ 2(2)) 24+ 2(1)}z4+ 2 (0) (5.22) 
This nested sequence of operations can be written as a linear difference equation in the form 
of 
y(m) = zy(m—-1)+2(N-—m) (5.23) 


with initial condition y (0) = 0, and the desired result being the solution at m = N. The 
value of the polynomial is given by 


AZpHaN) (5.24) 


(5.23) can be viewed as a first-order IIR filter with the input being the data sequence in 
reverse order and the value of the polynomial at z being the filter output sampled at m = N. 
Applying this to the DFT gives the Goertzel algorithm [283], [269] which is 


y(m) =W*y(m—1)+2(N—m) (5.25) 
with y (0) = 0 and 


C(k) =y(N) (5.26) 
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where 


C(k) = Se (n) Ww". (5.27) 


The flowgraph of the algorithm can be found in [62], [269] and a simple FORTRAN program 
is given in the appendix. 


When comparing this program with the direct calculation of (5.27), it is seen that the number 
of floating-point multiplications and additions are the same. In fact, the structures of the 
two algorithms look similar, but close examination shows that the way the sines and cosines 
enter the calculations is different. In (5.27), new sine and cosine values are calculated for 
each frequency and for each data value, while for the Goertzel algorithm in (5.25), they are 
calculated only for each frequency in the outer loop. Because of the recursive or feedback 
nature of the algorithm, the sine and cosine values are “updated" each loop rather than 
recalculated. This results in 2N trigonometric evaluations rather than 2N?. It also results 
in an increase in accumulated quantization error. 


It is possible to modify this algorithm to allow entering the data in forward order rather 
than reverse order. The difference (5.23) becomes 


y(m) =z 'y(m—1)+2(m—-1) (5.28) 
if (5.24) becomes 
C(k) =2%ty(N) (5.29) 
for y (0) = 0. This is the algorithm programmed later. 


5.3.2 The Second-Order Goertzel Algorithm 


One of the reasons the first-order Goertzel algorithm does not improve efficiency is that 
the constant in the feedback or recursive path is complex and, therefore, requires four real 
multiplications and two real additions. A modification of the scheme to make it second-order 
removes the complex multiplications and reduces the number of required multiplications by 
two. 


Define the variable q(m) so that 


y(m) =q(m)—2z*q(m—1). (5.30) 
This substituted into the right-hand side of (5.23) gives 


y(m) = zq(m—1)-—q(m—-2)+2(N—m). (5.31) 
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Combining (5.30) and (5.31) gives the second order difference equation 


q(m) = (z+ 27) q(m—1)-q(m—2)+2(N —m) (5.32) 
which together with the output (5.30), comprise the second-order Goertzel algorithm where 
X (2) =y(N) (5.33) 

for initial conditions gq (0) = q(—1) =0. 


A similar development starting with (5.28) gives a second-order algorithm with forward 
ordered input as 


q(m) = (z+ 27) q(m—- 1) -—q(m—- 2) + 2(m—-1) (5.34) 
y (m) = q(m) — zq(-1) (5.35) 

with 
Sige 2" Syn) (5.36) 


and for qg(0) = q(-—1) =0. 


Note that both difference (5.32) and (5.34) are not changed if z is replaced with z~', only 
the output (5.30) and (5.35) are different. This means that the polynomial X (z) may be 
evaluated at a particular z and its inverse z~' from one solution of the difference (5.32) or 
(5.34) using the output equations 


X (2) =q(N)— 2 q(N-1) (5.37) 


and 


X (1/z) = 28>} (q(N) — zq(N 1). (5.38) 


Clearly, this allows the DFT of a sequence to be calculated with half the arithmetic since 
the outputs are calculated two at a time. The second-order DE actually produces a solution 
q(m) that contains two first-order components. The output equations are, in effect, zeros 
that cancel one or the other pole of the second-order solution to give the desired first-order 
solution. In addition to allowing the calculating of two outputs at a time, the second-order 
DE requires half the number of real multiplications as the first-order form. This is because 
the coefficient of the q(m — 2) is unity and the coefficient of the ¢(m — 1) is real if z and 
z+ are complex conjugates of each other which is true for the DFT. 
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5.3.3 Analysis of Arithmetic Complexity and Timings 


Analysis of the various forms of the Goertzel algorithm from their programs gives the fol- 
lowing operation count for real multiplications and real additions assuming real data. 



































Algorithm Real Mults. | Real Adds | Trig Eval. 
Direct DFT 4N? 4 N? 2N? 
First-Order 4N? 4N?—-2N|2N 
Second-Order | 2N?+2N | 4N? 2N 
Second-Order 2 | N? +N 2N7+N |N 

Table 5.2 


Timings of the algorithms on a PC in milliseconds are given in the following table. 

















Algorithm N = 125 | N = 257 
Direct DFT 4.90 19.83 
First-Order 4.01 16.70 
Second-Order 2.64 11.04 
Second-Order 2 | 1.32 5.55 

















Table 5.3 


These timings track the floating point operation counts fairly well. 


5.3.4 Conclusions 


Goertzel’s algorithm in its first-order form is not particularly interesting, but the two-at-a- 
time second-order form is significantly faster than a direct DFT. It can also be used for any 
polynomial evaluation or for the DTFT at unequally spaced values or for evaluating a few 
DFT terms. A very interesting observation is that the inner-most loop of the Glassman- 
Ferguson FFT [124] is a first-order Goertzel algorithm even though that FFT is developed 
in a very different framework. 


In addition to floating-point arithmetic counts, the number of trigonometric function eval- 
uations that must be made or the size of a table to store precomputed values should be 
considered. Since the value of the W"* terms in (5.23) are iteratively calculate in the IIR 
filter structure, there is round-off error accumulation that should be analyzed in any appli- 
cation. 
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It may be possible to further improve the efficiency of the second-order Goertzel algorithm 
for calculating all of the DFT of a number sequence. Perhaps a fourth order DE could 
calculate four output values at a time and they could be separated by a numerator that 
would cancel three of the zeros. Perhaps the algorithm could be arranged in stages to give 
an N log(N) operation count. The current algorithm does not take into account any of the 
symmetries of the input index. Perhaps some of the ideas used in developing the QFT [53], 
[155], [158] could be used here. 


5.4 The Quick Fourier Transform (QFT) 


One stage of the QFT can use the symmetries of the sines and cosines to calculate a DFT 
more efficiently than directly implementing the definition Multidimensional Index Mapping: 
Equation 1 (3.1). Similar to the Goertzel algorithm, the one-stage QFT is a better N? DFT 
algorithm for arbitrary lengths. See The Cooley-Tukey Fast Fourier Transform Algorithm: 
The Quick Fourier Transform, An FFT based on Symmetries (Section 9.4: The Quick Fourier 
Transform, An FFT based on Symmetries). 
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Chapter 6 


Factoring the Signal Processing 
Operators’ 


A third approach to removing redundancy in an algorithm is to express the algorithm as 
an operator and then factor that operator into sparse factors. This approach is used by 
Tolimieri [382], [384], Egner [118], Selesnick, Elliott [121] and others. It is presented in a 
more general form in DFT and FFT: An Algebraic View (Chapter 8) The operators may be 
in the form of a matrix or a tensor operator. 


6.1 The FFT from Factoring the DFT Operator 


The definition of the DFT in Multidimensional Index Mapping: Equation 1 (3.1) can written 
as a matrix-vector operation by C = WX which, for N = 8 is 


C(0) we we we we wo we we we | [ «(0) 
C (1) we wi w? we wt we we w7 | | «(1) 
CG (2) we w? wt we we we we w | | «(2) 
c(3)|_| we we we we w? we ws w || 263) 63) 
C (4) w? w2 ws wl wis wo w?4 ws x (4) 
C (5) w? Ww we wh w2o w2 we w x (5) 
C (6) w? we wl wis w*4 wo w6 Ww” x (6) 
C (7) w? Ww? wht w?2t ws w Ww” w* x (7) 


which clearly requires N? = 64 complex multiplications and N (N — 1) additions. A factor- 
ization of the DFT operator, W, gives W = F, Fy F3 and C = F, Fy F3 X or, expanded, 
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= OPERATORS 
C'(0) 1 10 00 00 0 1 0 1 0 0 0 
C (4) 1-10 00 00 0 0 1 0 1 0 0 
C (2) 0 O1 10 00 O Ww’ 0 -W? 0 0 0 
c(6)| |0 01-10 00 0 0 Ww o —w? Ds 0 
C(1) 0. O20 Ok AG: 0 0 0 «©. “Eo 
C (5) 0 00 01-10 40 0 0 0 0 0 i 
C(3) 0 00 00 01 1 0 0 0 0 WwW 0 
C (7) 0 00 00 01-1 0 0 0 0 0 W? 
f 39.0. 0 St 0 0 0 x (0) 
0 1 0 0 0 1 0 0 x (1) 
0 0 1 0 0 0 1 0 x (2) 
0 0 0 1 0 0 0 1 x (3) 6.3) 
we 0 0 0 -W oO 0 0 x (4) 
0 Woo o -w oO 0 x (5) 
0 0 W 0 0 0 -W? 0 x (6) 


0 0 0 WwW 0 0 0 —w? | | «(7) 


where the F; matrices are sparse. Note that each has 16 (or 2N) non-zero terms and F and 
F3 have 8 (or N) non-unity terms. If N = 2”, then the number of factors is log (N) = M. In 
another form with the twiddle factors separated so as to count the complex multiplications 
we have 


Lt © 


bo 
ee Oo OG O&O & 


Ke 


ot 
ey Sm S&S oS OU OS SO 


Rp ES ye Ry he RED 
POS IS ES OS SS oe 
w a 
Wa WH a Ww ww we ww 
a OS so 2 So Se 
a ee ee ee eee) 
(  ) 
| 
KH 
oS Se eS oe Se SS SS 
pa ee Se 


“I 
| 
he 
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1 0 O 0 00 QO 0 Pet BA ie oO ee ot 0) 
Or de 'Q Oe 0) OQ 0 OL OQ) dk a Oe OF Q 
00W° 000 0 0 LO) ak As OY 202 > vO 
00 0 W000 O 0 OPS 80 SI Os. a> OA) (6.5) 
O20) 60 Oo > 0 0) 0 OPO! 0h Ae Boe eh) 0 
00 0 0 O01 0 0 OO, 08 0 SO SO. ee EOE - ol 
DoD. <Q 0. 6:00. We 0 Ot. OP AOE. de ie Aa’ 'D) 
OF oO. SO GO! 0 30" “Or WwW? O'Dea a0e A) Oe oT 
de er LOO s « i) 0 0 0 LOO: Le Or, tre x (0) 
010 0 0 0 0 0 0) be 20) 2005 he Oe <0) x (1) 
OO be 00 0 0 0 Oe Qn Db 0: Os ede 30 x (2) 
0. Or: fia 20 0 0 0 00.0 1 0 0 @ J x (3) (6.6) 
0000 W® 0 0 0 LQ: SOF Qs Sf. 0h. 20h. 0 x (4) 
0000 0 W 0 0 OE OO: 0 el. By. - x (5) 
OO OO. 0 0 W? 0 0.0 1°00) 0. 0 =—1 0 x (6) 
0. 200.0 Oe 0 0 0 WwW OF Ook, “Oe SO Oy ae] 27) 


which is in the form C = A; M, Ag My A3 X described by the index map. Aj, Ao, and 
A each represents 8 additions, or, in general, N additions. M, and Mp each represent 4 (or 
N/2) multiplications. 


This is a very interesting result showing that implementing the DFT using the factored form 
requires considerably less arithmetic than the single factor definition. Indeed, the form of 
the formula that Cooley and Tukey derived showing that the amount of arithmetic required 
by the FFT is on the order of Nlog (NV) can be seen from the factored operator formulation. 


Much of the theory of the FFT can be developed using operator factoring and it has some 
advantages for implementation of parallel and vector computer architectures. The eigenspace 
approach is somewhat of the same type [18]. 


6.2 Algebraic Theory of Signal Processing Algorithms 


A very general structure for all kinds of algorithms can be generalized from the approach 
of operators and operator decomposition. This is developed as “Algebraic Theory of Signal 
Processing" discussed in the module DFT and FFT: An Algebraic View (Chapter 8) by 
Piischel and others [118]. 
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Chapter 7 
Winograd’s Short DFT Algorithms' 


In 1976, S. Winograd [406] presented a new DFT algorithm which had significantly fewer 
multiplications than the Cooley-Tukey FFT which had been published eleven years earlier. 
This new Winograd Fourier Transform Algorithm (WFTA) is based on the type- one index 
map from Multidimensional Index Mapping (Chapter 3) with each of the relatively prime 
length short DFT’s calculated by very efficient special algorithms. It is these short algo- 
rithms that this section will develop. They use the index permutation of Rader described 
in the another module to convert the prime length short DFT’s into cyclic convolutions. 
Winograd developed a method for calculating digital convolution with the minimum number 
of multiplications. These optimal algorithms are based on the polynomial residue reduction 
techniques of Polynomial Description of Signals: Equation 1 (4.1) to break the convolution 
into multiple small ones [29], [235], [263], [416], [408], [197]. 


The operation of discrete convolution defined by 
y(n) = Doh(n—k) x(k) (7.1) 
k 


is called a bilinear operation because, for a fixed h(n), y(n) is a linear function of x (n) 
and for a fixed x(n) it is a linear function of h(n). The operation of cyclic convolution is 
the same but with all indices evaluated modulo N. 


Recall from Polynomial Description of Signals: Equation 3 (4.3) that length-N cyclic convo- 
lution of x(n) and h(n) can be represented by polynomial multiplication 


Y (s)=X(s) H(s) mod (s% —1) (7.2) 


This bilinear operation of (7.1) and (7.2) can also be expressed in terms of linear matrix 
operators and a simpler bilinear operator denoted by o which may be only a simple element- 
by-element multiplication of the two vectors [235], [197], [212]. This matrix formulation 





'This content is available online at <http://cnx.org/content /m16333/1.14/>. 
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is 


Y =C[AXoBH] (7.3) 


where X, H and Y are length-N vectors with elements of x(n), h(n) and y(n) respectively. 
The matrices A and B have dimension M x N , and C is N x M with M > N. The 
elements of A, B, and C’ are constrained to be simple; typically small integers or rational 
numbers. It will be these matrix operators that do the equivalent of the residue reduction 
on the polynomials in (7.2). 


In order to derive a useful algorithm of the form (7.3) to calculate (7.1), consider the polyno- 
mial formulation (7.2) again. To use the residue reduction scheme, the modulus is factored 
into relatively prime factors. Fortunately the factoring of this particular polynomial, s% — 1, 
has been extensively studied and it has considerable structure. When factored over the 
rationals, which means that the only coefficients allowed are rational numbers, the factors 
are called cyclotomic polynomials [29], [235], [263]. The most interesting property for our 
purposes is that most of the coefficients of cyclotomic polynomials are zero and the others 
are plus or minus unity for degrees up to over one hundred. This means the residue reduction 
will generally require no multiplications. 


The operations of reducing X (s) and H (s) in (7.2) are carried out by the matrices A and 
B in (7.3). The convolution of the residue polynomials is carried out by the o operator and 
the recombination by the CRT is done by the C matrix. More details are in [29], [235], 
[263], [197], [212] but the important fact is the A and B matrices usually contain only zero 
and plus or minus unity entries and the C’ matrix only contains rational numbers. The only 
general multiplications are those represented by o. Indeed, in the theoretical results from 
computational complexity theory, these real or complex multiplications are usually the only 
ones counted. In practical algorithms, the rational multiplications represented by C’ could 
be a limiting factor. 


The h(n) terms are fixed for a digital filter, or they represent the W terms from Multidi- 
mensional Index Mapping: Equation 1 (3.1) if the convolution is being used to calculate a 
DFT. Because of this, d = BH in (7.3) can be precalculated and only the A and C' opera- 
tors represent the mathematics done at execution of the algorithm. In order to exploit this 
feature, it was shown [416], [197] that the properties of (7.3) allow the exchange of the more 
complicated operator C' with the simpler operator B. Specifically this is given by 


Y =C|[AXoBH| (7.4) 
Y' = BT AXoC"H | (7.5) 
where H’ has the same elements as H, but in a permuted order, and likewise Y’ and Y. This 


very important property allows precomputing the more complicated C7 A” in (7.5) rather 
than BH as in (7.3). 
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Because BH or C7 H’ can be precomputed, the bilinear form of (7.3) and (7.5) can be written 
as a linear form. If an M x M diagonal matrix D is formed from d = C?H, or in the case 
of (7.3), d= BH, assuming a commutative property for 0, (7.5) becomes 


Y = B'DAX (7.6) 
and (7.3) becomes 


Y =CDAX (7.7) 


In most cases there is no reason not to use the same reduction operations on X and H, 
therefore, B can be the same as A and (7.6) then becomes 


¥ =A DAXxX (7.8) 
In order to illustrate how the residue reduction is carried out and how the A matrix is 
obtained, the length-5 DFT algorithm started in The DFT as Convolution or Filtering: 
Matrix 1 (5.12) will be continued. The DFT is first converted to a length-4 cyclic convolution 
by the index permutation from The DFT as Convolution or Filtering: Equation 3 (5.3) to 
give the cyclic convolution in The DFT as Convolution or Filtering (Chapter 5). To avoid 
confusion from the permuted order of the data x(n) in The DFT as Convolution or Filtering 
(Chapter 5), the cyclic convolution will first be developed without the permutation, using 
the polynomial U (s) 


U(s)=2(1)+2(3)s+2(4)s?+2(2)s° (7.9) 











U(s) =u(0)+u(1)s+u(2)s? + u(3) 8° (7.10) 
and then the results will be converted back to the permuted x(n). The length-4 cyclic 
convolution in terms of polynomials is 


Y (s) =U(s) H(s) mod (s*—1) (7.11) 
and the modulus factors into three cyclotomic polynomials 
s4—1= (s?—1) (s* +1) (7.12) 
= (s—1)(s +1) (s? +1) (7.13) 
= P, P, P3 (7.14) 


Both U (s) and H (s) are reduced modulo these three polynomials. The reduction modulo 
P, and P, is done in two stages. First it is done modulo (s? — 1), then that residue is further 
reduced modulo (s — 1) and (s + 1). 

U (s) = u0 + uls + ups” + ugs® (7.15) 
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U (s) = ((U (s))) ¢g2-1) = (uo + Ua) + (ui + us) 5 (7.16) 
U1(s) = ((U (s))) p, = (uo + ui + Ue + Us) (7.17) 
U2(s) = ((U (s))) p, = (uo — 4 + U2 — us) (7.18) 

U3(s) = ((U (s)))p, = (uo — U2) + (ta — ua) 8 (7.19) 


The reduction in (7.16) of the data polynomial (7.15) can be denoted by a matrix operation 
on a vector which has the data as entries. 











uo 
1 01 0 U Uo + U 
az nee (7.20) 
01041 Ug Uz, + UZ 
U3 
and the reduction in (7.19) is 
Uo 
1 0 -1 0O U Uo — U 
on (7.21) 
01 0 -1 Ug Uy — U3 
U3 
Combining (7.20) and (7.21) gives one operator 
1 O 1 0 Ug + U2Q Uo + U2 Wo 
01 0O 1 Uy + U ur +uU w 
1 3) 1 3) 1 (7.22) 
1 0 -l 0 Uo — U2 Ug — U2 Vo 
0 1 0 —l U1 — UZ U1, — U3 Ul 


Further reduction of vg + v1s is not possible because P; = s* + 1 cannot be factored over 
the rationals. However s? — 1 can be factored into P,P, = (s —1)(s+1) and, therefore, 
Wo + w1s can be further reduced as was done in (7.17) and (7.18) by 


Wo 
a 1 | = Wo + W1 = Up + U2 + U1 + U3 (7.23) 
Wy 
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Wo 
E -1 | = Wo — W1 = Up + U2 — U1 — UZ (7.24) 
Wi 


Combining (7.22), (7.23) and (7.24) gives 


1 1 0 0 1 O 1 0 uo TO 
1 -1 0 0 01 0 1 U r 
‘t= ] (7.25) 
0 0 1 0 1 0-1 O ug Vo 
0 0 O01 01 0 -!l U3 Vy 


The same reduction is done to H (s) and then the convolution of (7.11) is done by multiplying 
each residue polynomial of X (s) and H (s) modulo each corresponding cyclotomic factor of 
P(s) and finally a recombination using the polynomial Chinese Remainder Theorem (CRT) 
as in Polynomial Description of Signals: Equation 9 (4.9) and Polynomial Description of 
Signals: Equation 13 (4.13). 


Y (s) = ky (s) U; (s) Ay (s) + Ko (s) Up (s) Ay (s) + K3 (s) U3 (s) At (s) (7.26) 
mod (s* — 1) 
where U,(s) = r; and U2(s) = re are constants and U3(s) = vp + v1s is a first degree 


polynomial. Uj, times H, and U2 times Hp2 are easy, but multiplying U3 time H3 modulo 
(s? + 1) is more difficult. 


The multiplication of U3(s) times H3(s) can be done by the Toom-Cook algorithm [29], 
[235], [263] which can be viewed as Lagrange interpolation or polynomial multiplication 
modulo a special polynomial with three arbitrary coefficients. To simplify the arithmetic, 
the constants are chosen to be plus and minus one and zero. The details of this can be found 
in [29], [235], [263]. For this example it can be verified that 


((v0 + v1s) (hO + h1s)).2,, = (Voho — vihi) + (vohi + viho) s (1.27) 


which by the Toom-Cook algorithm or inspection is 


t,o mee eae 
= U 
01 Mel eee PN aeeiitee (7.28) 
-1 -1 1 Uy hy V1 
1. i 4 


where o signifies point-by-point multiplication. The total A matrix in (7.3) is a combination 
of (7.25) and (7.28) giving 


AX = AjApAgX es) 
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Le 00 1 
1 1 0 0 1 O 1 0 uo TO 

010 0 
1 -1 0 0 0 1 0 1 U1 TY 

= |0: 0°30 = (7.30) 

OG: 0-76 10-1 O Ug Vo 

OQ: <0. 90: SE 
0, 0 405d 01 0 -1 U3 Vy 

On a: Ah 


where the matrix A3 gives the residue reduction s? — 1 and s? +1, the upper left-hand part 
of A» gives the reduction modulo s—1 and s+1, and the lower right-hand part of Al carries 
out the Toom-Cook algorithm modulo s? + 1 with the multiplication in (7.5). Notice that 
by calculating (7.30) in the three stages, seven additions are required. Also notice that A, 
is not square. It is this “expansion" that causes more than N multiplications to be required 
in o in (7.5) or D in (7.6). This staged reduction will derive the A operator for (7.5) 


The method described above is very straight-forward for the shorter DFT lengths. For 
N = 3, both of the residue polynomials are constants and the multiplication given by 
o in (7.3) is trivial. For N = 5, which is the example used here, there is one first degree 
polynomial multiplication required but the Toom-Cook algorithm uses simple constants and, 
therefore, works well as indicated in (7.28). For N = 7, there are two first degree residue 
polynomials which can each be multiplied by the same techniques used in the N = 5 example. 
Unfortunately, for any longer lengths, the residue polynomials have an order of three or 
greater which causes the Toom-Cook algorithm to require constants of plus and minus two 
and worse. For that reason, the Toom-Cook method is not used, and other techniques such 
as index mapping are used that require more than the minimum number of multiplications, 
but do not require an excessive number of additions. The resulting algorithms still have the 
structure of (7.8). Blahut [29] and Nussbaumer [263] have a good collection of algorithms for 
polynomial multiplication that can be used with the techniques discussed here to construct 
a wide variety of DFT algorithms. 


The constants in the diagonal matrix D can be found from the CRT matrix C in (7.5) using 
d= C"H’ for the diagonal terms in D. As mentioned above, for the smaller prime lengths 
of 3, 5, and 7 this works well but for longer lengths the CRT becomes very complicated. An 
alternate method for finding D uses the fact that since the linear form (7.6) or (7.8) calculates 
the DFT, it is possible to calculate a known DFT of a given x(n) from the definition of the 
DFT in Multidimensional Index Mapping: Equation 1 (3.1) and, given the A matrix in (7.8), 
solve for D by solving a set of simultaneous equations. The details of this procedure are 
described in [197]. 


A modification of this approach also works for a length which is an odd prime raised to 
some power: N = P™. This is a bit more complicated [235], [416] but has been done for 
lengths of 9 and 25. For longer lengths, the conventional Cooley-Tukey type- two index 
map algorithm seems to be more efficient. For powers of two, there is no primitive root, 
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and therefore, no simple conversion of the DFT into convolution. It is possible to use two 
generators [235], [263], [408] to make the conversion and there exists a set of length 4, 8, and 
16 DFT algorithms of the form in (7.8) in [235]. 


In Table 7.1 an operation count of several short DFT algorithms is presented. These are 
practical algorithms that can be used alone or in conjunction with the index mapping to give 
longer DFT’s as shown in The Prime Factor and Winograd Fourier Transform Algorithms 
(Chapter 10). Most are optimized in having either the theoretical minimum number of mul- 
tiplications or the minimum number of multiplications without requiring a very large number 
of additions. Some allow other reasonable trade-offs between numbers of multiplications and 
additions. There are two lists of the number of multiplications. The first is the number of 
actual floating point multiplications that must be done for that length DFT. Some of these 
(one or two in most cases) will be by rational constants and the others will be by irrational 
constants. The second list is the total number of multiplications given in the diagonal matrix 
D in (7.8). At least one of these will be unity ( the one associated with X (0)) and in some 
cases several will be unity ( for N = 2™ ). The second list is important in programming the 
WFTA in The Prime Factor and Winograd Fourier Transform Algorithm: The Winograd 
Fourier Transform Algorithm (Section 10.2: The Winograd Fourier Transform Algorithm). 















































Length N | Mult Non-one | Mult Total | Adds 
2 0 4 4 

3 4 6 12 
4 0 8 16 
) 10 12 34 
if 16 18 72 
8 4 16 52 
9 20 22 84 
11 40 42 168 
13 AO 42 188 
16 20 36 148 
Ly 70 T2 314 
19 76 78 372 
25 132 134 420 
32 68 - 388 




















Table 7.1: Number of Real Multiplications and Additions for a Length-N DFT of Complex 
Data 
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Because of the structure of the short DFTs, the number of real multiplications required for 
the DFT of real data is exactly half that required for complex data. The number of real 
additions required is slightly less than half that required for complex data because (N — 1) 
of the additions needed when N is prime add a real to an imaginary, and that is not actually 
performed. When N = 2m, there are (N — 2) of these pseudo additions. The special case 
for real data is discussed in [101], [177], [356]. 


The structure of these algorithms are in the form of X = CDAX or B?’DAX or A’DAX 
from (7.5) and (7.8). The A and B matrices are generally M by N with M > N and have 
elements that are integers, generally 0 or +1. A pictorial description is given in Figure 7.1. 





1.000 





Figure 7.1: Flow Graph for the Length-5 DFT 











Figure 7.2: Block Diagram of a Winograd Short DFT 
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The flow graph in Figure 7.1 should be compared with the matrix description of (7.8) and 
(7.30), and with the programs in [29], [235], [63], [263] and the appendices. The shape in 
Figure 7.2 illustrates the expansion of the data by A. That is to say, AX has more entries 
than X because M > N. The A operator consists of additions, the D operator gives the M 
multiplications (some by one) and A’ contracts the data back to N values with additions 
only. M is one half the second list of multiplies in Table 7.1. 


An important characteristic of the D operator in the calculation of the DFT is its entries 
are either purely real or imaginary. The reduction of the W vector by (s\Y~)/? — 1) and 
(s\V—)/? + 1) separates the real and the imaginary constants. This is discussed in [416], 
[197]. The number of multiplications for complex data is only twice those necessary for real 
data, not four times. 


Although this discussion has been on the calculation of the DFT, very similar results are 
true for the calculation of convolution and correlation, and these will be further developed in 
Algorithms for Data with Restrictions (Chapter 12). The A?DA structure and the picture 
in Figure 7.2 are the same for convolution. Algorithms and operation counts can be found 
in [29], [263], [7]. 


7.1 The Bilinear Structure 


The bilinear form introduced in (7.3) and the related linear form in (7.6) are very powerful 
descriptions of both the DFT and convolution. 


Bilinear: Y = C [AX o BH] (7.31) 


Linear: Y =CDA X (7.32) 


Since (7.31) is a bilinear operation defined in terms of a second bilinear operator o , this 
formulation can be nested. For example if o is itself defined in terms of a second bilinear 
operator @, by 


XoH=C [AX @BH] (7.33) 
then (7.31) becomes 


Y =CC’ [A AX @ B BH] (7.34) 


For convolution, if A represents the polynomial residue reduction modulo the cyclotomic 
polynomials, then A is square (e.g. (7.25) and o represents multiplication of the residue 
polynomials modulo the cyclotomic polynomials. If A represents the reduction modulo the 
cyclotomic polynomials plus the Toom-Cook reduction as was the case in the example of 
(7.30), then A is NxM and 0 is term-by- term simple scalar multiplication. In this case AX 
can be thought of as a transform of X and C' is the inverse transform. This is called a 
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rectangular transform [7] because A is rectangular. The transform requires only additions 
and convolution is done with M multiplications. The other extreme is when A represents 
reduction over the N complex roots of s¥ — 1. In this case A is the DFT itself, as in the 
example of (43), and o is point by point complex multiplication and C is the inverse DFT. 
A trivial case is where A, B and C are identity operators and o is the cyclic convolution. 


This very general and flexible bilinear formulation coupled with the idea of nesting in (7.34) 
gives a description of most forms of convolution. 


7.2 Winograd’s Complexity Theorems 


Because Winograd’s work [29], [235], [416], [408], [413], [419] has been the foundation of 
the modern results in efficient convolution and DFT algorithms, it is worthwhile to look at 
his theoretical conclusions on optimal algorithms. Most of his results are stated in terms of 
polynomial multiplication as Polynomial Description of Signals: Equation 3 (4.3) or (7.11). 
The measure of computational complexity is usually the number of multiplications, and only 
certain multiplications are counted. This must be understood in order not to misinterpret 
the results. 


This section will simply give a statement of the pertinent results and will not attempt to 
derive or prove anything. A short interpretation of each theorem will be given to relate 
the result to the algorithms developed in this chapter. The indicated references should be 
consulted for background and detail. 


Theorem 1 [416] Given two polynomials, x(s) and h(s), of degree N and M respectively, 
each with indeterminate coefficients that are elements of a field H, N+M+1 multiplications 
are necessary to compute the coefficients of the product polynomial x (s) h(s). Multiplication 
by elements of the field G (the field of constants), which is contained in H, are not counted 
and G contains at least N + M distinct elements. 


The upper bound in this theorem can be realized by choosing an arbitrary modulus polyno- 
mial P (s) of degree N+M +1 composed of N+M +1 distinct linear polynomial factors with 
coefficients in G which, since its degree is greater than the product x (s)h(s), has no effect 
on the product, and by reducing x (s) and h(s) to N+ M +1 residues modulo the N+ M+1 
factors of P(s). These residues are multiplied by each other, requiring N + M + 1 mul- 
tiplications, and the results recombined using the Chinese remainder theorem (CRT). The 
operations required in the reduction and recombination are not counted, while the residue 
multiplications are. Since the modulus P (s) is arbitrary, its factors are chosen to be simple 
so as to make the reduction and CRT simple. Factors of zero, plus and minus unity, and 
infinity are the simplest. Plus and minus two and other factors complicate the actual calcu- 
lations considerably, but the theorem does not take that into account. This algorithm is a 
form of the Toom-Cook algorithm and of Lagrange interpolation [29], [235], [263], [416]. For 
our applications, H is the field of reals and G the field of rationals. 
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Theorem 2 [416] If an algorithm exists which computes x (s)h(s) in N + M +1 multipli- 
cations, all but one of its multiplication steps must necessarily be of the form 


mk = (gk +2 (gk)) (gk” +h(gk)) for k=0,1,..,N+M (7.35) 


) 


where g;, are distinct elements of G; and g, and g;,” are arbitrary elements of G 


This theorem states that the structure of an optimal algorithm is essentially unique although 
the factors of P (s) may be chosen arbitrarily. 


Theorem 3 [416] Let P(s) be a polynomial of degree N and be of the form P(s) = Q(s)k, 
where Q(s) is an irreducible polynomial with coefficients in G and k is a positive integer. 
Let x(s) and h(s) be two polynomials of degree at least N — 1 with coefficients from H, 
then 2N — 1 multiplications are required to compute the product x (s)h(s) modulo P(s). 


This theorem is similar to Theorem 1 (p. 52) with the operations of the reduction of the 
product modulo P (s) not being counted. 


Theorem 4 [416] Any algorithm that computes the product x (s) h (s) modulo P (s) accord- 
ing to the conditions stated in Theorem 3 and requires 2N —1 multiplications will necessarily 
be of one of three structures, each of which has the form of Theorem 2 internally. 


As in Theorem 2 (p. 52), this theorem states that only a limited number of possible structures 
exist for optimal algorithms. 


Theorem 5 [416] If the modulus polynomial P(s) has degree N and is not irreducible, it 
can be written in a unique factored form P (s) = P{"' (s) Pj (s) ...P2"* (s) where each of the 
P; (s) are irreducible over the allowed coefficient field G. 2N —k multiplications are necessary 
to compute the product x(s)h(s) modulo P(s) where x(s) and h(s) have coefficients in 
HT and are of degree at least N — 1. All algorithms that calculate this product in 2N —k 
multiplications must be of a form where each of the k residue polynomials of x (s) and h (s) 
are separately multiplied modulo the factors of P (s) via the CRT. 


Corollary: If the modulus polynomial is P (s) = s% — 1, then 2N —t(N) multiplications are 
necessary to compute x (s) h(s) modulo P(s), where t (NV) is the number of positive divisors 
of N. 


Theorem 5 (p. 53) is very general since it allows a general modulus polynomial. The proof of 
the upper bound involves reducing x (s) and h(s) modulo the k factors of P(s). Each of the 
k irreducible residue polynomials is then multiplied using the method of Theorem 4 (p. 53) 
requiring 2Ni—1 multiplies and the products are combined using the CRT. The total number 
of multiplies from the k parts is 2N—k. The theorem also states the structure of these optimal 
algorithms is essentially unique. The special case of P(s) = s% — 1 is interesting since it 
corresponds to cyclic convolution and, as stated in the corollary, k is easily determined. The 
factors of s‘ — 1 are called cyclotomic polynomials and have interesting properties [29], [235], 
[263]. 
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Theorem 6 [416], [408] Consider calculating the DFT of a prime length real-valued number 
sequence. If G is chosen as the field of rational numbers, the number of real multiplications 
necessary to calculate a length-P DFT is u(DFT (N)) = 2P —3-—t(P —1) where t(P — 1) 
is the number of divisors of P — 1. 


This theorem not only gives a lower limit on any practical prime length DFT algorithm, 
it also gives practical algorithms for N = 3,5, and 7. Consider the operation counts in 
Table 7.1 to understand this theorem. In addition to the real multiplications counted by 
complexity theory, each optimal prime-length algorithm will have one multiplication by a 
rational constant. That constant corresponds to the residue modulo (s-1) which always 
exists for the modulus P(s) = s‘~! — 1. In a practical algorithm, this multiplication must 
be carried out, and that accounts for the difference in the prediction of Theorem 6 (p. 54) 
and count in Table 7.1. In addition, there is another operation that for certain applications 
must be counted as a multiplication. That is the calculation of the zero frequency term 
X (0) in the first row of the example in The DFT as Convolution or Filtering: Matrix 1 
(5.12). For applications to the WFTA discussed in The Prime Factor and Winograd Fourier 
Transform Algorithms: The Winograd Fourier Transform Algorithm (Section 10.2: The 
Winograd Fourier Transform Algorithm), that operation must be counted as a multiply. 
For lengths longer than 7, optimal algorithms require too many additions, so compromise 
structures are used. 


Theorem 7 [419], [171] If G is chosen as the field of rational numbers, the number of real 
multiplications necessary to calculate a length-N DFT where N is a prime number raised to 
an integer power: N = Pm , is given by 


u(DFT(N)) =2N — ((m2+m) /2)t(P-1)—m-1 (7.36) 
where t (P — 1) is the number of divisors of (P — 1). 


This result seems to be practically achievable only for N = 9, or perhaps 25. In the case of 
N =9, there are two rational multiplies that must be carried out and are counted in Table 
7.1 but are not predicted by Theorem 7 (p. 54). Experience [187] indicates that even for 
N = 25, an algorithm based on a Cooley-Tukey FFT using a type 2 index map gives an 
over-all more balanced result. 


Theorem 8 [171] If G is chosen as the field of rational numbers, the number of real multi- 
plications necessary to calculate a length-N DFT where N = 2m is given by 


u(DFT (N)) =2N —m2—m-—2 (7.37) 


This result is not practically useful because the number of additions necessary to realize this 
minimum of multiplications becomes very large for lengths greater than 16. Nevertheless, 
it proves the minimum number of multiplications required of an optimal algorithm is a 
linear function of N rather than of NlogN which is that required of practical algorithms. 
The best practical power-of-two algorithm seems to the Split-Radix [105] FFT discussed 
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in The Cooley-Tukey Fast Fourier Transform Algorithm: The Split-Radix FFT Algorithm 
(Section 9.2: The Split-Radix FFT Algorithm). 


All of these theorems use ideas based on residue reduction, multiplication of the residues, 
and then combination by the CRT. It is remarkable that this approach finds the minimum 
number of required multiplications by a constructive proof which generates an algorithm 
that achieves this minimum; and the structure of the optimal algorithm is, within certain 
variations, unique. For shorter lengths, the optimal algorithms give practical programs. For 
longer lengths the uncounted operations involved with the multiplication of the higher degree 
residue polynomials become very large and impractical. In those cases, efficient suboptimal 
algorithms can be generated by using the same residue reduction as for the optimal case, but 
by using methods other than the Toom-Cook algorithm of Theorem 1 (p. 52) to multiply 
the residue polynomials. 


Practical long DFT algorithms are produced by combining short prime length optimal DF'T’s 
with the Type 1 index map from Multidimensional Index Mapping (Chapter 3) to give the 
Prime Factor Algorithm (PFA) and the Winograd Fourier Transform Algorithm (WFTA) 
discussed in The Prime Factor and Winograd Fourier Transform Algorithms (Chapter 10). 
It is interesting to note that the index mapping technique is useful inside the short DFT 
algorithms to replace the Toom-Cook algorithm and outside to combine the short DFT’s to 
calculate long DFT’s. 


7.3 The Automatic Generation of Winograd’s Short DFTs 


by Ivan Selesnick, Polytechnic Institute of New York University 


7.3.1 Introduction 


Efficient prime length DFTs are important for two reasons. A particular application may 
require a prime length DFT and secondly, the maximum length and the variety of lengths 
of a PFA or WFTA algorithm depend upon the availability of prime length modules. 


This [329], [835], [831], [333] discusses automation of the process Winograd used for con- 
structing prime length FFTs [29], [187] for N < 7 and that Johnson and Burrus [197| 
extended to N < 19. It also describes a program that will design any prime length FFT in 
principle, and will also automatically generate the algorithm as a C program and draw the 
corresponding flow graph. 


Winograd’s approach uses Rader’s method to convert a prime length DFT into a P—1 length 
cyclic convolution, polynomial residue reduction to decompose the problem into smaller con- 
volutions [29], [263], and the Toom-Cook algorithm [29], [252]. The Chinese Remainder 
Theorem (CRT) for polynomials is then used to recombine the shorter convolutions. Unfor- 
tunately, the design procedure derived directly from Winograd’s theory becomes cumbersome 
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for longer length DFTs, and this has often prevented the design of DFT programs for lengths 
greater than 19. 


Here we use three methods to facilitate the construction of prime length FFT modules. First, 
the matrix exchange property [29], [197], [218] is used so that the transpose of the reduction 
operator can be used rather than the more complicated CRT reconstruction operator. This is 
then combined with the numerical method [197] for obtaining the multiplication coefficients 
rather than the direct use of the CRT. We also deviate from the Toom-Cook algorithm, 
because it requires too many additions for the lengths in which we are interested. Instead we 
use an iterated polynomial multiplication algorithm [29]. We have incorporated these three 
ideas into a single structural procedure that automates the design of prime length FFTs. 


7.3.2 Matrix Description 


It is important that each step in the Winograd FFT can be described using matrices. By 
expressing cyclic convolution as a bilinear form, a compact form of prime length DFTs can 
be obtained. 


If y is the cyclic convolution of h and x, then y can be expressed as 


y =C[Az. « Bh] (7.38) 


where, using the Matlab convention, .* represents point by point multiplication. When A,B, 
and C’ are allowed to be complex, A and B are seen to be the DFT operator and C, the 
inverse DFT. When only real numbers are allowed, A, B, and C’ will be rectangular. This 
form of convolution is presented with many examples in [29]. Using the matrix exchange 
property explained in [29] and [197] this form can be written as 


y = RB? |C*Rh. « Az] (7.39) 
where R is the permutation matrix that reverses order. 


When h is fixed, as it is when considering prime length DFTs, the term C’ Rh can be precom- 
puted and a diagonal matrix D formed by D = diag{C? Rh}. This is advantageous because 
in general, C’ is more complicated than B, so the ability to “hide" C saves computation. 
Now y = RB’ DAz or y = RA'DAz since A and B can be the same; they implement a 
polynomial reduction. The form y = R? DAzT can also be used for the prime length DFTs, 
it is only necessary to permute the entries of x and to ensure that the DC term is computed 
correctly. The computation of the DC term is simple, for the residue of a polynomial modulo 
a — 1 is always the sum of the coefficients. After adding the x, term of the original input 
sequence, to the s —1 residue, the DC term is obtained. Now DFT{x} = RA? DAz. In [197] 
Johnson observes that by permuting the elements on the diagonal of D, the output can be 
permuted, so that the R matrix can be hidden in D, and DFT{x} = A?DAz. From the 
knowledge of this form, once A is found, D can be found numerically [197]. 
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7.3.3 Programming the Design Procedure 


Because each of the above steps can be described by matrices, the development of a prime 
length FFTs is made convenient with the use of a matrix oriented programming language 
such as Matlab. After specifying the appropriate matrices that describe the desired FFT 
algorithm, generating code involves compiling the matrices into the desired code for execu- 
tion. 


Each matrix is a section of one stage of the flow graph that corresponds to the DFT program. 
The four stages are: 


. Permutation Stage: Permutes input and output sequence. 

. Reduction Stage: Reduces the cyclic convolution to smaller polynomial products. 

. Polynomial Product Stage: Performs the polynomial multiplications. 

. Multiplication Stage: Implements the point-by-point multiplication in the bilinear 
form. 


Kwne ee 


Each of the stages can be clearly seen in the flow graphs for the DFTs. Figure 7.3 shows the 
flow graph for a length 17 DFT algorithm that was automatically drawn by the program. 



















































































Figure 7.3: Flowgraph of length-17 DFT 





The programs that accomplish this process are written in Matlab and C. Those that compute 
the appropriate matrices are written in Matlab. These matrices are then stored as two ASCII 
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files, with the dimensions in one and the matrix elements in the second. A C program then 
reads the flies and compiles them to produce the final FFT program in C [335] 


7.3.4 The Reduction Stage 


The reduction of an N“” degree polynomial, X (s), modulo the cyclotomic polynomial factors 
of (s% = 1) requires only additions for many N, however, the actual number of additions 
depends upon the way in which the reduction proceeds. The reduction is most efficiently 
performed in steps. For example, if N = 4 and ((X (s)),_,,((X (s)),,, amd ((X (s)) 244 
where the double parenthesis denote polynomial reduction modulo (s — 1), s +1, and s?+ 1, 
then in the first step ((X (s))),2_,, and ((Xs)),2,, should be computed. In the second step, 
((Xs)),_, and ((Xs)),,, can be found by reducing ((X (s))),2_, This process is described 
by the diagram in Figure 7.4. 





Figure 7.4: Factorization of s*— 1 in steps 





When N is even, the appropriate first factorization is (9%/? — 1) (s‘/? +1), however, the 
next appropriate factorization is frequently less obvious. The following procedure has been 
found to generate a factorization in steps that coincides with the factorization that minimizes 
the cumulative number of additions incurred by the steps. The prime factors of N are the 
basis of this procedure and their importance is clear from the useful well-known equation 
sY —1=]],,;vCn (s) where C,, (s) is the n“” cyclotomic polynomial. 


We first introduce the following two functions defined on the positive integers, 
wv (N) =the smallest prime factor of N forN > 1 (7.40) 
and (1) = 1. 


Suppose P(s) is equal to either (s% —1) or an intermediate noncyclotomic polynomial 
appearing in the factorization process, for example, (a? — 1), above. Write P (s) in terms of 
its cyclotomic factors, 


P (8) = Ch, (8) Chg (8) +++ Chey (7.41) 


Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


09 


define the two sets, G and G , by 


G={k,---,k~} and G = {k/gcd(G):k € G} (7.42) 
and define the two integers, t and T’, by 


t=min{w(k):kEG,k>1}and T= maxrnu(k,t):k eG} (7.43) 


Then form two new sets, 


A={keEG:T|k} and B={keEG:T|k} (7.44) 
The factorization of P (s), 


P(s) = (Ie: «)) (IIe: «)) (7.45) 


keA keB 


has been found useful in the procedure for factoring (s” — 1). This is best illustrated with 
an example. 


Example: N = 36 
Step 1. Let P(s) = gh — 1. Since P = C1 C2C'3C4CgCoC 12C 1 gC 36 


Ga'G = 4427374 6,972. 18,36} (7.46) 
t = min{2,3} =2 (7.47) 
A={keG:4|k} = {1,2,3,6,9, 18} (7.48) 
B={keEG:A4lk} = {4, 12,36} (7.49) 
Hence the factorization of s*° — 1 into two intermediate polynomials is as expected, 
[[@ (s)=s'8 -1, [[< (s)=s'%+1 (7.50) 
keA keB 


If a 36th degree polynomial, X (s), is represented by a vector of coefficients, X = 
(x35,+++ ,o) , then ((X (s)),1s_, (represented by X’) and ((X (s)),1s,, (represented by X") 
is given by 


test (7.51) 
which entails 36 additions. 
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Step 2. This procedure is repeated with P(s) = s!® — 1 and P(s) = s!®+ 1. We will just 
show it for the later. Let P(s) = s!® +1. Since P = CyC2C3¢ 


G = {4,12,36}, G ={1,3,9} (7.52) 

t= min3 =3 (7.53) 

T matty (k,3) ke G— marl.3,9 =9 (7.54) 
A=k€G: 9|k} = {4,12} (7.55) 
B=ke€G: 9k} = {36} (7.56) 


This yields the two intermediate polynomials, 


s°+1, and s'*%—s®+1 (h08) 
In the notation used above, 
i Ig Ie I¢ 
x | Ig Ie x (7.58) 
—I¢ Ig 


entailing 24 additions. Continuing this process results in a factorization in steps 


In order to see the number of additions this scheme uses for numbers of the form N = P—1 
(which is relevant to prime length FFT algorithms) figure 4 shows the number of additions 
the reduction process uses when the polynomial X(s) is real. 


Figure 4: Number of Additions for Reduction Stage 


7.3.5 The Polynomial Product Stage 


The iterated convolution algorithm can be used to construct an N point linear convolution 
algorithm from shorter linear convolution algorithms [29]. Suppose the linear convolution y, 
of the n point vectors x and h (h known) is described by 


y=E’ DE, & (7.59) 


where £,, is an “expansion” matrix the elements of which are +/’s and 0’s and D is an 
appropriate diagonal matrix. Because the only multiplications in this expression are by the 
elements of D, the number of multiplications required, M (n), is equal to the number of rows 
of F,. The number of additions is denoted by A (n). 
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Given a matrix F,,, and a matrix FE, the iterated algorithm gives a method for combining 
E,, and E,,, to construct a valid expansion matrix, F,, for N < ning. Specifically, 


Ens no = (Inecns) ® En) (rs x ny) (7.60) 


The product nynz may be greater than N, for zeros can be (conceptually) appended to z. 
The operation count associated with E,,, n, is 


A(ny, nz) = n!A (ng) + A(m1) Mn (7.61) 


M (ny, n2) = M (n1) M (ne) (7.62) 


Although they are both valid expansion matrices, Enjn. A Enon, and Anyny F Ans.m, 
Because Mnjing F Mnon, it is desirable to chose an ordering of factors to minimize the 
additions incurred by the expansion matrix. The following [7], [263] follows from above. 


7.3.5.1 Multiple Factors 


Note that a valid expansion matrix, Ey, can be constructed from EF,,,,,, and E,,,, for N < 
nyngn3. In general, any number of factors can be used to create larger expansion matrices. 
The operation count associated with Ey, nz nz 18 


A (ni, n2, n3) = NWA (n3) + mA (n2) M (n3) + A (n1) M (n2) M (n3) (7.63) 


M (ny, n2,n3) = M (nz) M (nz) M (n3) (7.64) 


These equations generalize in the predicted way when more factors are considered. Because 
the ordering of the factors is relevant in the equation for A(.) but not for M (.), it is again 
desirable to order the factors to minimize the number of additions. By exploiting the fol- 
lowing property of the expressions for A(.) and M (.), the optimal ordering can be found 


[“1. 


reservation of Optimal Ordering. Suppose A (ni, 2,73) < min{A (Nk, Nk, Nkg) : ki, ke, ks € 
{1, 2,3} and distinct}, then 


1. 

A(n1,n2) < A (ne, 1) (7.65) 
2: 

A (na, n3) < A (ns, n2) (7.66) 
3. 

A(n1,n3) < A(n3, 1) (7.67) 
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The generalization of this property to more than two factors reveals that an optimal ordering 
of {n1,-+- ,nz-i} is preserved in an optimal ordering of {n1,---nz}. Therefore, if (n1,---nz) 
is an optimal ordering of {n1,---nz}, then (nz,nx¢41) is an optimal ordering of {ng, n%41} 
and consequently 


A (nk) e A (nx41) 





7.68 
M (ng) — re ~ M (neq) — Met ee) 
for all k = 1,2,--- , 2-1. 
This immediately suggests that an optimal ordering of {n1,---nz,} is one for which 
A A 
(m4) (rx) (7.69) 





M(m)—m  M(nz)—nz 


is nondecreasing. Hence, ordering the factors, {n,,---nz}, to minimize the number of 
additions incurred by E£,,.... xn, simply involves computing the appropriate ratios. 


7.3.6 Discussion and Conclusion 


We have designed prime length FFTs up to length 53 that are as good as the previous designs 
that only went up to 19. Table 1 gives the operation counts for the new and previously 
designed modules, assuming complex inputs. 


It is interesting to note that the operation counts depend on the factorability of P — 1. The 
primes 11, 23, and 47 are all of the form 1+ 2P, making the design of efficient FFTs for 
these lengths more difficult. 


Further deviations from the original Winograd approach than we have made could prove use- 
ful for longer lengths. We investigated, for example, the use of twiddle factors at appropriate 
points in the decomposition stage; these can sometimes be used to divide the cyclic convolu- 
tion into smaller convolutions. Their use means, however, that the ’center* multiplications 
would no longer be by purely real or imaginary numbers. 
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N | Mult | Adds 
7 | 16 2 
11 | 40 168 
13 | 40 188 
LE oz 274 
19 | 88 360 
23 | 174 | 672 
29 | 190 766 
31 | 160 | 984 
37 | 220 | 920 
Al | 282 1140 
43 | 304 1416 
A7 | 640 2088 
53 | 556 2038 








Table 7.2: Operation counts for prime length DFTs 
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The approach in writing a program that writes another program is a valuable one for several 
reasons. Programming the design process for the design of prime length FFTs has the 
advantages of being practical, error-free, and flexible. The flexibility is important because it 
allows for modification and experimentation with different algorithmic ideas. Above all, it 


has allowed longer DFTs to be reliably designed. 


More details on the generation of programs for prime length FFTs can be found in the 1993 


Technical Report. 
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Chapter 8 


DFT and FFT: An Algebraic View' 


by Markus Pueschel, Carnegie Mellon University 


In infinite, or non-periodic, discrete-time signal processing, there is a strong connection 
between the z-transform, Laurent series, convolution, and the discrete-time Fourier transform 
(DTFT) [277]. As one may expect, a similar connection exists for the DFT but bears 
surprises. Namely, it turns out that the proper framework for the DFT requires modulo 
operations of polynomials, which means working with so-called polynomial algebras [138]. 
Associated with polynomial algebras is the Chinese remainder theorem, which describes the 
DFT algebraically and can be used as a tool to concisely derive various FFTs as well as 
convolution algorithms [268], [409], [414], [12] (see also Winograd’s Short DFT Algorithms 
(Chapter 7)). The polynomial algebra framework was fully developed for signal processing as 
part of the algebraic signal processing theory (ASP). ASP identifies the structure underlying 
many transforms used in signal processing, provides deep insight into their properties, and 
enables the derivation of their fast algorithms [295], [293], [291], [294]. Here we focus on the 
algebraic description of the DFT and on the algebraic derivation of the general-radix Cooley- 
Tukey FFT from Factoring the Signal Processing Operators (Chapter 6). The derivation will 
make use of and extend the Polynomial Description of Signals (Chapter 4). We start with 
motivating the appearance of modulo operations. 


The z-transform associates with infinite discrete signals X = (--- ,x(—1),x(0),x(1),---) 
a Laurent series: 


XW X(s)= Six (n) s”. (8.1) 


neZ 


Here we used s = z~! to simplify the notation in the following. The DTFT of X is the 
evaluation of X (s) on the unit circle 


X(e*), -a<we<nt. (8.2) 





'This content is available online at <http://cnx.org/content /m16331/1.14/>. 
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66 CHAPTER 8. DFT AND FFT: AN ALGEBRAIC VIEW 
Finally, filtering or (linear) convolution is simply the multiplication of Laurent series, 


H «xX — H(s)X(s). (8.3) 


For finite signals X = (x (0) ,--- ,2 (NV — 1)) one expects that the equivalent of (8.1) becomes 
a mapping to polynomials of degree N — 1, 


XH X(s)= Se (n) 8”, (8.4) 


and that the DFT is an evaluation of these polynomials. Indeed, the definition of the DFT 
in Winograd’s Short DFT Algorithms (Chapter 7) shows that 


C(k) = X (Wt) =X (7*) , O<k<N, (8.5) 
i.e., the DFT computes the evaluations of the polynomial X (s) at the nth roots of unity. 


The problem arises with the equivalent of (8.3), since the multiplication H (s) X (s) of two 
polynomials of degree N — 1 yields one of degree 2N — 2. Also, it does not coincide with the 
circular convolution known to be associated with the DFT. The solution to both problems 
is to reduce the product modulo s” — 1: 




















H *cixeX << H (s) X (s) mod (s”—1). (8.6) 
Concept Infinite Time Finite Time 
Signal (6) = 57 ert)" eae (ns 
Filter H (8) = Yonegh (n) 8” Xe oe Gis” 
Convolution H (s) X (s) H (s) X (s) mod (s" — 1) 
Fourier transform | DTFT: X (e~J”),  —-a7<w<a/| DFT: X (ce) (ee a 














Table 8.1: Infinite and finite discrete time signal processing. 


The resulting polynomial then has again degree N — 1 and this form of convolution becomes 
equivalent to circular convolution of the polynomial coefficients. We also observe that the 
evaluation points in (8.5) are precisely the roots of 8” —1. This connection will become clear 
in this chapter. 


The discussion is summarized in Table 8.1. 


The proper framework to describe the multiplication of polynomials modulo a fixed polyno- 
mial are polynomial algebras. Together with the Chinese remainder theorem, they provide 
the theoretical underpinning for the DFT and the Cooley-Tukey FFT. 
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In this chapter, the DFT will naturally arise as a linear mapping with respect to chosen bases, 
i.e., aS a matrix. Indeed, the definition shows that if all input and outputs are collected into 
vectors X = (X (0),---,X (NV —1)) and C = (C' (0),---C (N — 1)), then Winograd’s Short 
DFT Algorithms (Chapter 7) is equivalent to 


C = DFTyX, (8.7) 


where 


DFT y = [Wy"] (8.8) 


O<kn<N' 
The matrix point of view is adopted in the FFT books [388], [381]. 


8.1 Polynomial Algebras and the DFT 


In this section we introduce polynomial algebras and explain how they are associated to 
transforms. Then we identify this connection for the DFT. Later we use polynomial algebras 
to derive the Cooley-Tukey FFT. 


For further background on the mathematics in this section and polynomial algebras in par- 
ticular, we refer to [138]. 


8.1.1 Polynomial Algebra 


An algebra A is a vector space that also provides a multiplication of its elements such that 
the distributivity law holds (see [138] for a complete definition). Examples include the sets of 
complex or real numbers C or R, and the sets of complex or real polynomials in the variable 
s: C{s] or R[s]. 


























The key player in this chapter is the polynomial algebra. Given a fixed polynomial P (s) 
of degree deg (P) = N, we define a polynomial algebra as the set 


Cs] /P (s) = {X (s) | deg(X) < deg (P)} (8.9) 


of polynomials of degree smaller than N with addition and multiplication modulo P. Viewed 
as a vector space, C |s] /P (s) hence has dimension N. 


Every polynomial X (s) € C[s] is reduced to a unique polynomial R(s) modulo P(s) of 
degree smaller than N. R(s) is computed using division with rest, namely 


X (s) =Q(s) P(s)+R(s), deg(R) < deg (P). (8.10) 


Regarding this equation modulo P, P (s) becomes zero, and we get 


X (s) = R(s) mod P(s). (8.11) 
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We read this equation as “X (s) is congruent (or equal) R(s) modulo P(s).” We will also 
write X (s) mod P(s) to denote that X (s) is reduced modulo P(s). Obviously, 


P(s) =0 mod P(s). (8.12) 


As a simple example we consider A = C |[s] / (s? — 1), which has dimension 2. A possible 
basis is b = (1,8). In A, for example, s-(s +1) = s?+s =s+1 mod (s?—1), obtained 
through division with rest 


s°+s=1-(s?-1)+(s4+1)) (8.13) 


or simply by replacing s? with 1 (since s* — 1 = 0 implies s? = 1). 


8.1.2 Chinese Remainder Theorem (CRT) 


Assume P(s) = Q(s) R(s) factors into two coprime (no common factors) polynomials Q 
and R. Then the Chinese remainder theorem (CRT) for polynomials is the linear mapping? 


A: C[s]/P(s) > Cs] /Q(s) @C[s] /R(s), 


(8.14) 
X(s) + (X(s) mod Q(s),X (s) mod R(s)). 


Here, © is the Cartesian product of vector spaces with elementwise operation (also called 
outer direct sum). In words, the CRT asserts that computing (addition, multiplication, 
scalar multiplication) in C [s] /P (s) is equivalent to computing in parallel in C [s] /Q(s) and 
C{s] /R(s). 


If we choose bases b,c,d in the three polynomial algebras, then A can be expressed as a 
matrix. As usual with linear mappings, this matrix is obtained by mapping every element 
of b with A, expressing it in the concatenation c Ud of the bases c and d, and writing the 
results into the columns of the matrix. 


As an example, we consider again the polynomial P(s) = s? — 1 = (s—1)(s +1) and the 
CRT decomposition 


A: C{s]/(s?-1) +C[s]/(@-1) @C[s] /(e@+1). (8.15) 


As bases, we choose b = (1,2), c= (1), d= (1). A(1) = (1,1) with the same coordinate 
vector in cUd = (1,1). Further, because of x = 1 mod (#—1) and x = —1 mod (x +1), 
A(x) = (a,x) = (1,-1) with the same coordinate vector. Thus, A in matrix form is the 


1 
b=) 


so-called butterfly matrix, which is a DFT of size 2: DFT) = 





?More precisely, isomorphism of algebras or isomorphism of A-modules. 
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8.1.3 Polynomial Transforms 
Assume P(s) € C[s] has pairwise distinct zeros a = (ao,--- ,an—i). Then the CRT can be 
used to completely decompose C |s] /P (s) into its spectrum: 

A: C[s]|/P(s) — C[s] /(s — ao) @--- @C[s] / (s — an-1), 
X(s) + (X(s) mod (s—ao),--:,X (s) mod (s — ay-1)) (8.16) 

= (s(a0),-++ ,8(an-1))- 

If we choose a basis b = (Po(s),--- , Py-1(s)) in C[s] /P(s) and bases b; = (1) in each 
Cs] /(s—a;), then A, as a linear mapping, is represented by a matrix. The matrix is 
obtained by mapping every basis element P,,, 0 < n < N, and collecting the results in the 
columns of the matrix. The result is 


Poa = [Pr (ax )lock.n<n (8.17) 
and is called the polynomial transform for A = C |s] /P (s) with basis b. 


If, in general, we choose b; = ((;) as spectral basis, then the matrix corresponding to the 
decomposition (8.16) is the scaled polynomial transform 


diagocren (1/5n) Poa (8.18) 
where diagg<n<en (Yn) denotes a diagonal matrix with diagonal entries 7,,. 


We jointly refer to polynomial transforms, scaled or not, as Fourier transforms. 


8.1.4 DFT as a Polynomial Transform 
We show that the DFT is a polynomial transform for A = C[s]/(s‘% —1) with basis 
b = (1,s,---,s%~1). Namely, 


sv-1= |] (c«-Wh), (8.19) 


O0<k<N 


which means that A takes the form 
A: C{s]/(s’-1) — C[s] /(s -Wx) ®---@C[s] /(s -Wy’), 
X(s) + (X(s) mod (s—W ),---,X(s) mod (s—Wy')) (8.20) 
= (X (Wy) ,---,X (Wy). 


The associated polynomial transform hence becomes 


Poa = [Wr"| = DFTy. (8.21) 


0<k,n<N 
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This interpretation of the DFT has been known at least since [409], [268] and clarifies the 
connection between the evaluation points in (8.5) and the circular convolution in (8.6). 


In [40], DFTs of types 1-4 are defined, with type 1 being the standard DFT. In the algebraic 
framework, type 3 is obtained by choosing A = C[s]/(s% +1) as algebra with the same 
basis as before: 


Poa = |wier/ ame = DET 3s (8.22) 


0<k,n<N 
The DFTs of type 2 and 4 are scaled polynomial transforms [295]. 


8.2 Algebraic Derivation of the Cooley-Tukey FFT 


Knowing the polynomial algebra underlying the DFT enables us to derive the Cooley-Tukey 
FFT algebraically. This means that instead of manipulating the DFT definition, we ma- 
nipulate the polynomial algebra C {s] / (s — 1). The basic idea is intuitive. We showed that 
the DFT is the matrix representation of the complete decomposition (8.20). The Cooley- 
Tukey FFT is now derived by performing this decomposition in steps as shown in Figure 8.1. 
Each step yields a sparse matrix; hence, the DF'T'y is factorized into a product of sparse 
matrices, which will be the matrix representation of the Cooley-Tukey FFT. 





C[s]/P(s) 


partial decomposition 


Fourier transform 


@ Clsl/(s— an) 


O<k<N 


Figure 8.1: Basic idea behind the algebraic derivation of Cooley-Tukey type algorithms 





This stepwise decomposition can be formulated generically for polynomial transforms [292], 
[294]. Here, we consider only the DFT. 


We first introduce the matrix notation we will use and in particular the Kronecker product 
formalism that became mainstream for FFTs in [388], [381]. 


Then we first derive the radix-2 FFT using a factorization of s‘’ — 1. Subsequently, we 
obtain the general-radix FFT using a decomposition of s‘ — 1. 
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8.2.1 Matrix Notation 
We denote the N x N identity matrix with Iy, and diagonal matrices with 


‘Yo 
diagocr<n (Vk) = : (8.23) 


YN-1 
The N x N stride permutation matrix is defined for N = KM by the permutation 
IN: 1K +jrjM+i (8.24) 
for0 <i1< K,0<j73< M. This definition shows that Ge transposes a kK x M matrix 
stored in row-major order. Alternatively, we can write 
LY ?i 1 iMmod N—1, ford<i<N—1, N-1 4 N=1. (8.25) 


For example (- means 0), 


1 
1 
LS : (8.26) 
1 
1 
1 
ix /2 is sometimes called the perfect shuffle. 
Further, we use matrix operators; namely the direct sum 
A 
A®B= (8.27) 
B 
and the Kronecker or tensor product 
A®B=[axeB),,, for A= [axe]. (8.28) 
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In particular, 


I,@A=AQ--@A=| (8.29) 


is block-diagonal. 


We may also construct a larger matrix as a matrix of matrices, e.g., 


A B 
BA 


(8.30) 


If an algorithm for a transform is given as a product of sparse matrices built from the 
constructs above, then an algorithm for the transpose or inverse of the transform can be 
readily derived using mathematical properties including 

(AB) = BTAT, (AB) = BIA“, 
(A@B)’ =AT@BT, (ASB) '=A'OB, (8.31) 
(A@B) =AT@BT, (A@B)'=A QB. 


Permutation matrices are orthogonal, i.e, P’ = P~!. The transposition or inversion of 
diagonal matrices is obvious. 


8.2.2 Radix-2 FFT 


The DFT decomposes A = C[s]/(s% —1) with basis b = (1,s,--- ,s%~') as shown in 
(8.20). We assume N = 2M. Then 


s°M —1=(s”—1) (s“ +1) (8.32) 


factors and we can apply the CRT in the following steps: 


Clee 4) (8.33) 
— C[s] /(s“ —1) @C[s] / (s&“ +1) 
> © Cls]/(c-Wr)@ @® Cls]/ (a — Wyj*") (8.34) 


0<i<M O<i<M 
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C —W3). 
at ee Clee iN (8.35) 
As bases in the smaller algebras C [s] / (s“ — 1) and C[s] / (s” +1), we choose c = d = 
(1,s,---,s“~t). The derivation of an algorithm for DFT based on (8.33)-(8.35) is now 
completely mechanical by reading off the matrix for each of the three decomposition steps. 
The product of these matrices is equal to the DFT y. 


First, we derive the base change matrix B corresponding to (8.33). To do so, we have to 
express the base elements s” € b in the basis c U d; the coordinate vectors are the columns 
of B. For0 <n < M, s” is actually contained in c and d, so the first IV columns of B are 


B= ; (8.36) 


where the entries * are determined next. For the base elements s“*", 0 <n < M, we have 





sMin = 5" mod (s” —1), 
(8.37) 
gsMin = —s" mod (s“ +1), 
which yields the final result 
I I 
Bae OM | 29 ris oli (8.38) 
Iy —In 


Next, we consider step (8.34). C [s] / (s — 1) is decomposed by DFT y and C [s] / (s“ +1) 
by DFT-3y in (8.22). 


Finally, the permutation in step (8.35) is the perfect shuffle L4,, which interleaves the even 
and odd spectral components (even and odd exponents of Wy). 


The final algorithm obtained is 


DFT oy = LN, (DFT y ® DFT-3y) (DFT: ® Iu). (8.39) 


To obtain a better known form, we use DFT-3y = DFT Dw, with Dy = diaggejey (Wy), 
which is evident from (8.22). It yields 


DFTs, LN (DFT y ® DFTyDm) (DFT2® Iu) 


(8.40) 
L®, Ig @ DFT y) (Im ® Dy) (DFT2 ® In). 
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The last expression is the radix-2 decimation-in-frequency Cooley-Tukey FFT. The cor- 
responding decimation-in-time version is obtained by transposition using (8.31) and the 
symmetry of the DFT: 

DFT oy = (DFT2 ® In) (Im © Du) (Io ® DFT y) LY. (8.41) 


The entries of the diagonal matrix [jy @ Dy are commonly called twiddle factors. 


The above method for deriving DFT algorithms is used extensively in [268]. 


8.2.3 General-radix FFT 


To algebraically derive the general-radix FFT, we use the decomposition property of 
s% —1. Namely, if N = KM then 


sN —1=(sM)* —-1, (8.42) 


Decomposition means that the polynomial is written as the composition of two polynomials: 
here, s™“ is inserted into s* — 1. Note that this is a special property: most polynomials do 
not decompose. 


Based on this polynomial decomposition, we obtain the following stepwise decomposition 
of C [s] / (s’ — 1), which is more general than the previous one in (8.33)-(8.35). The basic 
idea is to first decompose with respect to the outer polynomial t* — 1, t = s™”, and then 
completely [292]: 


CIs] /(s¥ -1) = Cla} /((s")* -1) 


| (8.43) 
a Cc M _ Ww: 
eg 18M (0! — Wi) 
- “Seagate 

oe em Cll/ (@ Wy ) (8.44) 

= Cc — Wi). 
el ea) (8.45) 
As bases in the smaller algebras C [s] / (s“ — Wi.) we choose ¢; = (1,s,---,s”~'). As 


before, the derivation is completely mechanical from here: only the three matrices corre- 
sponding to (8.43)—(8.45) have to be read off. 


The first decomposition step requires us to compute s” mod (s” — Wk), 0O<n< WN. To 
do so, we decompose the index n as n = (M +m and compute 


ee ee (sM)‘s™ =W,"s™ mod (s“ — Wi). (8.46) 


This shows that the matrix for (8.43) is given by DFT x @ In. 
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In step (8.44), each C [s] / (s“ — Wj.) is completely decomposed by its polynomial transform 


DFT y (i, K) = DFT y - diaggejey (Wx) - (8.47) 


At this point, C [s] / (3h — 1) is completely decomposed, but the spectrum is ordered ac- 
cording to jK +i,0<i<M,0<j<K (j runs faster). The desired order is iM + j. 


Thus, in step (8.45), we need to apply the permutation 7K +it> iM + j, which is exactly 
the stride permutation L{y, in (8.24). 


In summary, we obtain the Cooley-Tukey decimation-in-frequency FFT with arbitrary radix: 


Ly ( ® DFT mu: diagi5' (ws) (DFT; ® Im) 


O<i< 


= LN (Ix @ DFT) TN (DFT, ® Iu). 


(8.48) 


The matrix Tj) is diagonal and usually called the twiddle matrix. Transposition using 
(8.31) yields the corresponding decimation-in-time version: 


(DFT; ® Im) TH (In ® DFT y) LX. (8.49) 


8.3 Discussion and Further Reading 


This chapter only scratches the surface of the connection between algebra and the DFT or 
signal processing in general. We provide a few references for further reading. 


8.3.1 Algebraic Derivation of Transform Algorithms 


As mentioned before, the use of polynomial algebras and the CRT underlies much of the early 
work on FFTs and convolution algorithms [409], [268], [12]. For example, Winograd’s work on 
FFTs minimizes the number of non-rational multiplications. This and his work on complexity 
theory in general makes heavy use of polynomial algebras [409], [414], [417] (see Chapter 
Winograd’s Short DFT Algorithms (Chapter 7) for more information and references). See 
[72] for a broad treatment of algebraic complexity theory. 


Since C [2] /(s% —1) = C[Cy] can be viewed a group algebra for the cyclic group, the 
methods shown in this chapter can be translated into the context of group representation 
theory. For example, [256] derives the general-radix FFT using group theory and also uses 
already the Kronecker product formalism. So does Beth and started the area of FFTs 
for more general groups [23], [231]. However, Fourier transforms for groups have found only 
sporadic applications [317]. Along a related line of work, [117] shows that using group theory 
it is possible that to discover and generate certain algorithms for trigonometric transforms, 
such as discrete cosine transforms (DCTs), automatically using a computer program. 
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More recently, the polynomial algebra framework was extended to include most trigonometric 
transforms used in signal processing [293], [295], namely, besides the DFT, the discrete cosine 
and sine transforms and various real DFTs including the discrete Hartley transform. It turns 
out that the same techniques shown in this chapter can then be applied to derive, explain, 
and classify most of the known algorithms for these transforms and even obtain a large 
class of new algorithms including general-radix algorithms for the discrete cosine and sine 
transforms (DCTs/DSTs) [292], [294], [398], [397]. 


This latter line of work is part of the algebraic signal processing theory briefly discussed 
next. 


8.3.2 Algebraic Signal Processing Theory 


The algebraic properties of transforms used in the above work on algorithm derivation hints 
at a connection between algebra and (linear) signal processing itself. This is indeed the case 
and was fully developed in a recent body of work called algebraic signal processing theory 
(ASP). The foundation of ASP is developed in [295], [293], [291]. 


ASP first identifies the algebraic structure of (linear) signal processing: the common as- 
sumptions on available operations for filters and signals make the set of filters an algebraA 
and the set of signals an associated A-module M. ASP then builds a signal processing 
theory formally from the axiomatic definition of a signal model: a triple (A, M,®), where 
® generalizes the idea of the z-transform to mappings from vector spaces of signal values 
to M. If a signal model is given, other concepts, such as spectrum, Fourier transform, fre- 
quency response are automatically defined but take different forms for different models. For 
example, infinite and finite time as discussed in Table 8.1 are two examples of signal models. 
Their complete definition is provided in Table 8.2 and identifies the proper notion of a finite 
z-transform as a mapping C” — C |s] /(s” — 1). 

















Signal model Infinite time Finite time 
A (donez H (nm) 8” | | Cla] /(s" — 1) 
ee SHH 1) eh Oy a Lye) S 
e' (Z)} 
continued on next page 
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® ®: /C(Z) 43M ®: C?-M 
defined in (8.1) defined in (8.4) 

















Table 8.2: Infinite and finite time models as defined in ASP. 


ASP shows that many signal models are in principle possible, each with its own notion of 
filtering and Fourier transform. Those that support shift-invariance have commutative alge- 
bras. Since finite-dimensional commutative algebras are precisely polynomial algebras, their 
appearance in signal processing is explained. For example, ASP identifies the polynomial 
algebras underlying the DCTs and DSTs, which hence become Fourier transforms in the ASP 
sense. The signal models are called finite space models since they support signal processing 
based on an undirected shift operator, different from the directed time shift. Many more 
insights are provided by ASP including the need for and choices in choosing boundary con- 
ditions, properties of transforms, techniques for deriving new signal models, and the concise 
derivation of algorithms mentioned before. 
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Chapter 9 


The Cooley-Tukey Fast Fourier 
Transform Algorithm’ 


The publication by Cooley and Tukey [90] in 1965 of an efficient algorithm for the calculation 
of the DFT was a major turning point in the development of digital signal processing. During 
the five or so years that followed, various extensions and modifications were made to the 
original algorithm [95]. By the early 1970’s the practical programs were basically in the 
form used today. The standard development presented in [274], [299], [38] shows how the 
DFT of a length-N sequence can be simply calculated from the two length-N/2 DFT’s of 
the even index terms and the odd index terms. This is then applied to the two half-length 
DFT’s to give four quarter-length DFT’s, and repeated until N scalars are left which are the 
DFT values. Because of alternately taking the even and odd index terms, two forms of the 
resulting programs are called decimation-in-time and decimation-in-frequency. For a length 
of 2”, the dividing process is repeated M = log,N times and requires N multiplications 
each time. This gives the famous formula for the computational complexity of the FFT of 
Nlog.N which was derived in Multidimensional Index Mapping: Equation 34 (3.34). 


Although the decimation methods are straightforward and easy to understand, they do 
not generalize well. For that reason it will be assumed that the reader is familiar with that 
description and this chapter will develop the FFT using the index map from Multidimensional 
Index Mapping (Chapter 3). 


The Cooley-Tukey FFT always uses the Type 2 index map from Multidimensional Index 
Mapping: Equation 11 (3.11). This is necessary for the most popular forms that have 
N = R™, but is also used even when the factors are relatively prime and a Type 1 map could 
be used. The time and frequency maps from Multidimensional Index Mapping: Equation 6 
(3.6) and Multidimensional Index Mapping: Equation 12 (3.12) are 


n= (Ayn, + Kon2)) y (9.1) 





‘This content is available online at <http://cnx.org/content /m16334/1.13/>. 
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k = ((K3ky + K4k2)) y (9.2) 

Type-2 conditions Multidimensional Index Mapping: Equation 8 (3.8) and Multidimensional 
Index Mapping: Equation 11 (3.11) become 

ky = aNo or Ko = bN, but not both (9.3) 


and 


K3 = cNo or Ky = dN, but not both (9.4) 
The row and column calculations in Multidimensional Index Mapping: Equation 15 (3.15) 
are uncoupled by Multidimensional Index Mapping: Equation 16 (3.16) which for this case 
are 


((A,4))y =0 or ((K2K3)), =0 but not both (9.5) 
To make each short sum a DFT, the K; must satisfy 


(1.3) y = No and ((K2K4)) y = Ni (9.6) 


In order to have the smallest values for K; the constants in (9.3) are chosen to be 


C20] ho] ha >i (9.7) 


which makes the index maps of (9.1) become 


n= Noni + ng (9.8) 


k = ky + Nyko (9.9) 


These index maps are all evaluated modulo N, but in (9.8), explicit reduction is not nec- 

essary since n never exceeds N. The reduction notation will be omitted for clarity. From 
Multidimensional Index Mapping: Equation 15 (3.15) and example Multidimensional Index 
Mapping: Equation 19 (3.19), the DFT is 


No-1Ni-1 


Be Da We (9.10) 


n2=0ni1=0 


This map of (9.8) and the form of the DFT in (9.10) are the fundamentals of the Cooley- 
Tukey FFT. 


The order of the summations using the Type 2 map in (9.10) cannot be reversed as it can 
with the Type-1 map. This is because of the Wy terms, the twiddle factors. 
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Turning (9.10) into an efficient program requires some care. From Multidimensional Index 
Mapping: Efficiencies Resulting from Index Mapping with the DFT (Section 3.3: Efficiencies 
Resulting from Index Mapping with the DFT) we know that all the factors should be equal. 
If N = R™ , with R called the radix, Nj, is first set equal to R and Ny is then necessarily 
R™~!. Consider n; to be the index along the rows and nz along the columns. The inner 
sum of (9.10) over n, represents a length-N, DFT for each value of ng. These N2 length- 
N, DFT’s are the DFT’s of the rows of the x(n1,n2) array. The resulting array of row 
DFT’s is multiplied by an array of twiddle factors which are the Wy terms in (9.10). The 
twiddle-factor array for a length-8 radix-2 FFT is 


we w® tk 
Ese w wi 1 W 
ces Wo = ; abe (9.11) 
wr W 1 -j 
we we 1 -—jw 


The twiddle factor array will always have unity in the first row and first column. 


To complete (9.10) at this point, after the row DFT’s are multiplied by the TF array, the Ny 
length-N2 DFT’s of the columns are calculated. However, since the columns DFT’s are of 
length R“—', they can be posed as a R“~? by R array and the process repeated, again using 
length-R DFT’s. After M stages of length-R DFT’s with TF multiplications interleaved, 
the DFT is complete. The flow graph of a length-2 DFT is given in Figure 1 (7.18) and is 
called a butterfly because of its shape. The flow graph of the complete length-8 radix-2 FFT 
is shown in Figure 2 (7.19) . 





x(0) X(0) = x(0) + x(1) 


x(1) - X(0) = x(0) - x(1) 


Radix-2 Butterfly 


Figure 9.1: A Radix-2 Butterfly 
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Figure 9.2: Length-8 Radix-2 FFT Flow Graph 





This flow-graph, the twiddle factor map of (9.11), and the basic equation (9.10) should be 
completely understood before going further. 


A very efficient indexing scheme has evolved over the years that results in a compact and 
efficient computer program. A FORTRAN program is given below that implements the 
radix-2 FFT. It should be studied [64] to see how it implements (9.10) and the flow-graph 
representation. 
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N2=N 
DO 10 K = 1, M 
Nii seN2 
N2 = N2/2 
E = 6.28318/N1 
A 26 
DO 20 J= 1, N2 
C = COS (A) 
S =-SIN (A) 
A = Jx*E 
DO 30 I= J, N, Ni 
b= 2 N2 
XT = X(I) - X(L) 
SCS XG 
YT = Y(I) - Y(L) 
Y(I) = Y(I) + Y(L) 
X(L). = XT#C..= YT#s 
Y(L) = XT*S + YT*C 
30 CONTINUE 
20 CONTINUE 


10 CONTINUE 


Listing 9.1: A Radix-2 Cooley-Tukey FFT Program 





This discussion, the flow graph of Winograd’s Short DFT Algorithms: Figure 2 (Figure 7.2) 
and the program of p. ?? are all based on the input index map of Multidimensional Index 
Mapping: Equation 6 (3.6) and (9.1) and the calculations are performed in-place. Accord- 
ing to Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling 
(Section 3.2: In-Place Calculation of the DFT and Scrambling), this means the output is 
scrambled in bit-reversed order and should be followed by an unscrambler to give the DFT 
in proper order. This formulation is called a decimation-in-frequency FFT [274], [299], [38]. 
A very similar algorithm based on the output index map can be derived which is called a 
decimation-in-time FFT. Examples of FFT programs are found in [64] and in the Appendix 
of this book. 
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Soon after the paper by Cooley and Tukey, there were improvements and extensions made. 
One very important discovery was the improvement in efficiency by using a larger radix of 
4, 8 or even 16. For example, just as for the radix-2 butterfly, there are no multiplications 
required for a length-4 DFT, and therefore, a radix-4 FFT would have only twiddle factor 
multiplications. Because there are half as many stages in a radix-4 FFT, there would be half 
as many multiplications as in a radix-2 FFT. In practice, because some of the multiplications 
are by unity, the improvement is not by a factor of two, but it is significant. A radix-4 FFT 
is easily developed from the basic radix-2 structure by replacing the length-2 butterfly by 
a length-4 butterfly and making a few other modifications. Programs can be found in [64| 
and operation counts will be given in "Evaluation of the Cooley-Tukey FFT Algorithms" 
(Section 9.3: Evaluation of the Cooley-Tukey FFT Algorithms). 


Increasing the radix to 8 gives some improvement but not as much as from 2 to 4. Increasing 
it to 16 is theoretically promising but the small decrease in multiplications is somewhat 
offset by an increase in additions and the program becomes rather long. Other radices are 
not attractive because they generally require a substantial number of multiplications and 
additions in the butterflies. 


The second method of reducing arithmetic is to remove the unnecessary TF multiplications 
by plus or minus unity or by plus or minus the square root of minus one. This occurs 
when the exponent of Wy is zero or a multiple of N/4. A reduction of additions as well 
as multiplications is achieved by removing these extraneous complex multiplications since 
a complex multiplication requires at least two real additions. In a program, this reduction 
is usually achieved by having special butterflies for the cases where the TF is one or 7. As 
many as four special butterflies may be necessary to remove all unnecessary arithmetic, but 
in many cases there will be no practical improvement above two or three. 


In addition to removing multiplications by one or 7, there can be a reduction in multiplica- 
tions by using a special butterfly for TFs with Wy g, which have equal real and imaginary 
parts. Also, for computers or hardware with multiplication considerably slower than ad- 
dition, it is desirable to use an algorithm for complex multiplication that requires three 
multiplications and three additions rather than the conventional four multiplications and 
two additions. Note that this gives no reduction in the total number of arithmetic opera- 
tions, but does give a trade of multiplications for additions. This is one reason not to use 
complex data types in programs but to explicitly program complex arithmetic. 


A time-consuming and unnecessary part of the execution of a FFT program is the calculation 
of the sine and cosine terms which are the real and imaginary parts of the TFs. There are 
basically three approaches to obtaining the sine and cosine values. They can be calculated 
as needed which is what is done in the sample program above. One value per stage can be 
calculated and the others recursively calculated from those. That method is fast but suffers 
from accumulated round-off errors. The fastest method is to fetch precalculated values from 


Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


89 


a stored table. This has the disadvantage of requiring considerable memory space. 


If all the N DFT values are not needed, special forms of the FFT can be developed using 
a process called pruning [226] which removes the operations concerned with the unneeded 
outputs. 


Special algorithms are possible for cases with real data or with symmetric data [82]. The 
decimation-in-time algorithm can be easily modified to transform real data and save half the 
arithmetic required for complex data [357]. There are numerous other modifications to deal 
with special hardware considerations such as an array processor or a special microprocessor 
such as the Texas Instruments TMS320. Examples of programs that deal with some of these 
items can be found in [299], [64], [82]. 


9.2 The Split-Radix FFT Algorithm 


Recently several papers [228], [106], [393], [350], [102] have been published on algorithms to 
calculate a length-2” DFT more efficiently than a Cooley-Tukey FFT of any radix. They 
all have the same computational complexity and are optimal for lengths up through 16 and 
until recently was thought to give the best total add-multiply count possible for any power- 
of-two length. Yavne published an algorithm with the same computational complexity in 
1968 [421], but it went largely unnoticed. Johnson and Frigo have recently reported the 
first improvement in almost 40 years [201]. The reduction in total operations is only a few 
percent, but it is a reduction. 


The basic idea behind the split-radix FFT (SRFFT) as derived by Duhamel and Hollmann 
[106], [102] is the application of a radix-2 index map to the even-indexed terms and a radix-4 
map to the odd- indexed terms. The basic definition of the DFT 


N-1 
Cy = th wrk (9.12) 
n=0 
with W = e~?7/" gives 
N/2-1 
Cor, — S- [ey + Tn+N/2| Wore (9.13) 
n=0 
for the even index terms, and 
N/4-1 
Cao = Ss [es = Tn+N/2) a (Cn4N/4 = Tn+3N/4) | w” wire (9.14) 
n=0 
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and 
N/4-1 
Cariz= > [(tn—-2nrn/2) +5 (nana — Tntanya)] We" Wi* (9.15) 
n=0 


for the odd index terms. This results in an L-shaped “butterfly” shown in Figure 9.3 
which relates a length-N DFT to one length-N/2 DFT and two length-N/4 DFT’s with 
twiddle factors. Repeating this process for the half and quarter length DFT’s until scalars 
result gives the SRFFT algorithm in much the same way the decimation-in-frequency radix-2 
Cooley-Tukey FFT is derived [274], [299], [38]. The resulting flow graph for the algorithm 
calculated in place looks like a radix-2 FFT except for the location of the twiddle factors. 
Indeed, it is the location of the twiddle factors that makes this algorithm use less arithmetic. 
The L- shaped SRFFT butterfly Figure 9.3 advances the calculation of the top half by one 
of the M stages while the lower half, like a radix-4 butterfly, calculates two stages at once. 
This is illustrated for N = 8 in Figure 9.4. 





Figure 9.3: SRFFT Butterfly 
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Figure 9.4: Length-8 SRFFT 





Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey FFT or even the prime 
factor algorithm or Winograd Fourier transform algorithm , the Split-Radix FFT does not 
progress completely stage by stage, or, in terms of indices, does not complete each nested 
sum in order. This is perhaps better seen from the polynomial formulation of Martens 
[228]. Because of this, the indexing is somewhat more complicated than the conventional 
Cooley-Tukey program. 


A FORTRAN program is given below which implements the basic decimation-in-frequency 
split-radix FFT algorithm. The indexing scheme [350] of this program gives a structure 
very similar to the Cooley-Tukey programs in [64] and allows the same modifications and 
improvements such as decimation-in-time, multiple butterflies, table look-up of sine and 
cosine values, three real per complex multiply methods, and real data versions [102], [357]. 
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SUBROUTINE FFT(X,Y,N,M) 
N2 = 2*N 
DO 10 K= 1, M-1 


N2 = N2/2 
N4 = N2/4 
E = 6.283185307179586/N 
A=0 
DO 20 J=1, N4 
A3 = 3*A 
Cc1 = COS(A) 
SS1 = SIN(A) 


CC3 = COS(A3) 
SS3 = SIN(A3) 


A = J*E 

Is =]J 

ID = 2*N2 

DO 30 I0 = IS, N-1, 
Ii = 10 + N4 
12 = I1 + N4 
I3 = 12 + N4 
Ri = X(I0) - 
X(I10) = X(I10) + 
R2 = X(T1) - 
X(I1) = X(11) + 
S1 = YCIO) - 
Y(I0) = Y(IO) + 
S2 = Y(I1) - 
BOE AS OM 
S3 =R1 - 82 
Ri =R1+ 82 
S2 = R2- Si 
R2 = R2+ Si 


X(I12) = R1*CC1 - S2*SS1 
Y(I2) =-S2*CC1 - R1*SS1 
X(13) = S3*CC3 + R2*SS3 
Y(13) = R2*CC3 - S3*SS3 


CONTINUE 
IS = 2*ID - N2+ J 
ID = 4*ID 
IF (IS.LT.N) GOTO 40 
CONTINUE 
CONTINUE 


2 


X(12) 
X(I2) 
X(I3) 
X(I3) 
¥ (12) 
Y(I2) 
Y(I3) 
Y (13) 


IS =avhilable for free at Connexions <http://cnx.org/content /col10550/1.22> 


ID= 4 
DO 60 I0 = IS, N, ID 
Ti = 10+ 1 
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As was done for the other decimation-in-frequency algorithms, the input index map is used 
and the calculations are done in place resulting in the output being in bit-reversed order. 
It is the three statements following label 30 that do the special indexing required by the 
SRFFT. The last stage is length- 2 and, therefore, inappropriate for the standard L-shaped 
butterfly, so it is calculated separately in the DO 60 loop. This program is considered a one- 
butterfly version. A second butterfly can be added just before statement 40 to remove the 
unnecessary multiplications by unity. A third butterfly can be added to reduce the number 
of real multiplications from four to two for the complex multiplication when W has equal real 
and imaginary parts. It is also possible to reduce the arithmetic for the two- butterfly case 
and to reduce the data transfers by directly programming a length-4 and length-8 butterfly 
to replace the last three stages. This is called a two-butterfly-plus version. Operation counts 
for the one, two, two-plus and three butterfly SRFFT programs are given in the next section. 
Some details can be found in [350]. 


The special case of a SRFFT for real data and symmetric data is discussed in [102]. An 
application of the decimation-in-time SRFFT to real data is given in [357]. Application to 
convolution is made in [110], to the discrete Hartley transform in [352], [110], to calculating 
the discrete cosine transform in [393], and could be made to calculating number theoretic 
transforms. 


An improvement in operation count has been reported by Johnson and Frigo [201] which 
involves a scaling of multiplying factors. The improvement is small but until this result, it 
was generally thought the Split-Radix FFT was optimal for total floating point operation 
count. 


9.3 Evaluation of the Cooley-Tukey FFT Algorithms 


The evaluation of any FFT algorithm starts with a count of the real (or floating point) 
arithmetic. Table 9.1 gives the number of real multiplications and additions required to 
calculate a length-N FFT of complex data. Results of programs with one, two, three and 
five butterflies are given to show the improvement that can be expected from removing 
unnecessary multiplications and additions. Results of radices two, four, eight and sixteen 
for the Cooley-Tukey FFT as well as of the split-radix FFT are given to show the relative 
merits of the various structures. Comparisons of these data should be made with the table 
of counts for the PFA and WFTA programs in The Prime Factor and Winograd Fourier 
Transform Algorithms: Evaluation of the PFA and WFTA (Section 10.4: Evaluation of the 
PFA and WFTA). All programs use the four-multiply-two-add complex multiply algorithm. 
A similar table can be developed for the three-multiply-three-add algorithm, but the relative 
results are the same. 


From the table it is seen that a greater improvement is obtained going from radix-2 to 4 than 
from 4 to 8 or 16. This is partly because length 2 and 4 butterflies have no multiplications 
while length 8, 16 and higher do. It is also seen that going from one to two butterflies gives 
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more improvement than going from two to higher values. From an operation count point of 
view and from practical experience, a three butterfly radix-4 or a two butterfly radix-8 FFT 
is a good compromise. The radix-8 and 16 programs become long, especially with multiple 
butterflies, and they give a limited choice of transform length unless combined with some 
length 2 and 4 butterflies. 














N M1 M2 M3 M5 Al A2 A3 AD5 
2 4 0 0 0 6 4 4 4 

4 16 4 0 0 24 18 16 16 
8 48 20 8 4 72 58 52 52 





16 128 68 40 28 192 162 148 148 
32 320 196 136 108 480 418 388 388 
64 768 516 392 332 1152 1026 964 964 
128 | 1792 | 1284 | 1032 | 908 2688 2434 2308 2308 
256 | 4096 | 3076 | 2568 | 2316 | 6144 5634 5380 5380 
512 | 9216 | 7172 | 6152 | 5644 | 13824 | 12802 | 12292 | 12292 
1024 | 20480 | 16388 | 14344 | 13324 | 30720 | 28674 | 27652 | 27652 
2048 | 45056 | 36868 | 32776 | 30732 | 67584 | 63490 | 61444 | 61444 
4096 | 98304 | 81924 | 73736 | 69644 | 147456 | 139266 | 135172 | 135172 
4 12 0 0 0 a2 16 16 16 

16 96 36 28 24 176 146 144 144 
64 576 324 284 264 1056 930 920 920 
256 | 3072 | 2052 | 1884 | 1800 | 5632 5122 5080 5080 
1024 | 15360 | 11268 | 10588 | 10248 | 28160 | 26114 | 25944 | 25944 










































































continued on next page 
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4096 | 73728 | 57348 | 54620 | 53256 | 135168 | 126978 | 126296 | 126296 
8 32 4 4 4 66 52 52 52 

64 512 260 252 248 1056 930 928 928 
512 | 6144 | 4100 | 4028 | 3992 | 12672 | 11650 | 11632 | 11632 
4096 | 65536 | 49156 | 48572 | 48280 | 135168 | 126978 | 126832 | 126832 
16 80 20 20 20 178 148 148 148 
256 | 2560 | 1540 | 1532 | 1528 | 5696 5186 5184 5184 
4096 | 61440 | 45060 | 44924 | 44856 | 136704 | 128514 | 128480 | 128480 
































2 0 0 0 0 4 4 4 4 
4 8 0 0 0 20 16 16 16 
8 24 8 4 4 60 52 52 92 





16 72 32 28 24 164 144 144 144 
32 184 104 92 84 412 372 372 372 
64 456 288 268 248 996 912 O12 012 
128 | 1080 | 744 700 660 2332 2164 2164 2164 
256 | 2504 | 1824 | 1740 | 1656 | 5348 9008 9008 5008 
512 | 5688 | 4328 | 4156 | 3988 | 12060 | 11380 | 11380 | 11380 
1024 | 12744 | 10016 | 9676 | 9336 | 26852 | 25488 | 25488 | 25488 
2048 | 28216 | 22760 | 22076 | 21396 | 59164 | 56436 | 56436 | 56436 
4096 | 61896 | 50976 | 49612 | 48248 | 129252 | 123792 | 123792 | 123792 



























































Table 9.1: Number of Real Multiplications and Additions for Complex Single Radix FFTs 


In Table 9.1 Mi and Ai refer to the number of real multiplications and real additions used 
by an FFT with i separately written butterflies. The first block has the counts for Radix- 
2, the second for Radix-4, the third for Radix-8, the fourth for Radix-16, and the last for 
the Split-Radix FFT. For the split-radix FFT, M3 and A3 refer to the two- butterfly-plus 
program and M5 and A5d refer to the three-butterfly program. 


The first evaluations of FFT algorithms were in terms of the number of real multiplications 
required as that was the slowest operation on the computer and, therefore, controlled the 
execution speed. Later with hardware arithmetic both the number of multiplications and 
additions became important. Modern systems have arithmetic speeds such that indexing 
and data transfer times become important factors. Morris [249] has looked at some of 
these problems and has developed a procedure called autogen to write partially straight-line 
program code to significantly reduce overhead and speed up FFT run times. Some hardware, 
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such as the TMS320 signal processing chip, has the multiply and add operations combined. 
Some machines have vector instructions or have parallel processors. Because the execution 
speed of an FFT depends not only on the algorithm, but also on the hardware architecture 
and compiler, experiments must be run on the system to be used. 


In many cases the unscrambler or bit-reverse-counter requires 10% of the execution time, 
therefore, if possible, it should be eliminated. In high-speed convolution where the convo- 
lution is done by multiplication of DFT’s, a decimation-in-frequency FFT can be combined 
with a decimation-in-time inverse FFT to require no unscrambler. It is also possible for a 
radix-2 FFT to do the unscrambling inside the FFT but the structure is not very regular 
[299], [193]. Special structures can be found in [299] and programs for data that are real or 
have special symmetries are in [82], [102], [357]. 


Although there can be significant differences in the efficiencies of the various Cooley-Tukey 
and Split-Radix FFTs, the number of multiplications and additions for all of them is on the 
order of NlogN. That is fundamental to the class of algorithms. 


9.4 The Quick Fourier Transform, An FFT based on Sym- 
metries 


The development of fast algorithms usually consists of using special properties of the algo- 
rithm of interest to remove redundant or unnecessary operations of a direct implementation. 
The discrete Fourier transform (DFT) defined by 


N-1 
C (k) = Soa (n) Wat (9.16) 
n=0 
where 
Wy =e 20/N (9.17) 


has enormous capacity for improvement of its arithmetic efficiency. Most fast algorithms use 
the periodic and symmetric properties of its basis functions. The classical Cooley-Tukey FFT 
and prime factor FFT [64] exploit the periodic properties of the cosine and sine functions. 
Their use of the periodicities to share and, therefore, reduce arithmetic operations depends 
on the factorability of the length of the data to be transformed. For highly composite lengths, 
the number of floating-point operation is of order Nlog (NV) and for prime lengths it is of 
order N?. 


This section will look at an approach using the symmetric properties to remove redundancies. 
This possibility has long been recognized [176], [211], [344], [270] but has not been developed 
in any systematic way in the open literature. We will develop an algorithm, called the quick 
Fourier transform (QFT) [211], that will reduce the number of floating point operations 
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necessary to compute the DFT by a factor of two to four over direct methods or Goertzel’s 
method for prime lengths. Indeed, it seems the best general algorithm available for prime 
length DFTs. One can always do better by using Winograd type algorithms but they must 
be individually designed for each length. The Chirp Z-transform can be used for longer 
lengths. 


9.4.1 Input and Output Symmetries 


We use the fact that the cosine is an even function and the sine is an odd function. The 
kernel of the DFT or the basis functions of the expansion is given by 


Wk = @F27k/N — cos (Qank/N) + j sin (2nnk/N) (9.18) 


which has an even real part and odd imaginary part. If the data x(n) are decomposed into 
their real and imaginary parts and those into their even and odd parts, we have 


t(n) = u(n) + ju (1m) = [te (n) + Uo ()] +9 [We (2) + Vo (n)] (9.19) 


where the even part of the real part of x(n) is given by 


ue(n) = (u(n) +u(=n)) /2 (9.20) 
and the odd part of the real part is 


Uo (n) = (u(n) — u(—n)) /2 (9.21) 
with corresponding definitions of uv, (n) and v, (n). Using Convolution Algorithms: Equation 
32 (13.32) with a simpler notation, the DFT of Convolution Algorithms: Equation 29 (13.29) 
becomes 


N-1 
Ci k= S- (u+ jv) (cos — jsin). (9.22) 
n=0 
The sum over an integral number of periods of an odd function is zero and the sum of an 
even function over half of the period is one half the sum over the whole period. This causes 
(9.16) and (9.22) to become 


N/2-1 
Clk y= > [ue COS + Up Sin] + j [Ve cos — Uo sin]. (9.23) 
n=0 


for k =0,1,2,---,N—1. 


The evaluation of the DFT using equation (9.23) requires half as many real multiplication 
and half as many real additions as evaluating it using (9.16) or (9.22). We have exploited 
the symmetries of the sine and cosine as functions of the time index n. This is independent 
of whether the length is composite or not. Another view of this formulation is that we have 
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used the property of associatively of multiplication and addition. In other words, rather 
than multiply two data points by the same value of a sine or cosine then add the results, one 
should add the data points first then multiply the sum by the sine or cosine which requires 
one rather than two multiplications. 


Next we take advantage of the symmetries of the sine and cosine as functions of the frequency 
index k. Using these symmetries on (9.23) gives 


N/2-1 
C(k) = » [u. cos + Uy sin] + j [ve cos — vu, sin] (9.24) 
n=0 
N/2-1 
C(N —k)= S- [ue cos — vo sin] + 7 [vecos + vu, sin]. (9.25) 
n=0 


for k = 0,1,2,---,N/2—1. This again reduces the number of operations by a factor of 

two, this time because it calculates two output values at a time. The first reduction by a 
factor of two is always available. The second is possible only if both DFT values are needed. 
It is not available if you are calculating only one DFT value. The above development has 
not dealt with the details that arise with the difference between an even and an odd length. 
That is straightforward. 


9.4.2 Further Reductions if the Length is Even 


If the length of the sequence to be transformed is even, there are further symmetries that 
can be exploited. There will be four data values that are all multiplied by plus or minus the 
same sine or cosine value. This means a more complicated pre-addition process which is a 
generalization of the simple calculation of the even and odd parts in (9.20) and (9.21) will 
reduce the size of the order N? part of the algorithm by still another factor of two or four. It 
the length is divisible by 4, the process can be repeated. Indeed, it the length is a power of 
2, one can show this process is equivalent to calculating the DFT in terms of discrete cosine 
and sine transforms [156], [159] with a resulting arithmetic complexity of order N log (N) 
and with a structure that is well suited to real data calculations and pruning. 


If the flow-graph of the Cooley-Tukey FFT is compared to the flow-graph of the QFT, one 
notices both similarities and differences. Both progress in stages as the length is continually 
divided by two. The Cooley-Tukey algorithm uses the periodic properties of the sine and 
cosine to give the familiar horizontal tree of butterflies. The parallel diagonal lines in this 
graph represent the parallel stepping through the data in synchronism with the periodic basis 
functions. The QFT has diagonal lines that connect the first data point with the last, then 
the second with the next to last, and so on to give a “star" like picture. This is interesting in 
that one can look at the flow graph of an algorithm developed by some completely different 
strategy and often find section with the parallel structures and other parts with the star 
structure. These must be using some underlying periodic and symmetric properties of the 
basis functions. 
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9.4.3 Arithmetic Complexity and Timings 


A careful analysis of the QFT shows that 2N additions are necessary to compute the even and 
odd parts of the input data. This is followed by the length N/2 inner product that requires 
4(N/2)* = N? real multiplications and an equal number of additions. This is followed by the 
calculations necessary for the simultaneous calculations of the first half and last half of C (k) 
which requires 4(NV/2) = 2N real additions. This means the total QFT algorithm requires 
M? real multiplications and N? + 4N real additions. These numbers along with those for 
the Goertzel algorithm [52], [64], [270] and the direct calculation of the DFT are included 
in the following table. Of the various order-N? DFT algorithms, the QFT seems to be the 


most efficient general method for an arbitrary length N. 



































Timings of the algorithms on a PC in milliseconds are given in the following table. 






































These timings track the floating point operation counts fairly well. 


Algorithm N = 125 | N = 256 

Direct DFT 4.90 19.83 

Mod. 20. Goertzel | 1.32 5.55 

QFT 1.09 4.50 

Chirp + FFT 1.70 3.52 
Table 9.3 


Algorithm Real Mults. | Real Adds | Trig Eval. 
Direct DFT 4 N? 4 N? 2N? 
Mod. 2nd Order Goertzel | N? + N 2N7+N |N 
QFT N? N244N | 2N 

Table 9.2 
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9.4.4 Conclusions 


The QFT is a straight-forward DFT algorithm that uses all of the possible symmetries of 
the DFT basis function with no requirements on the length being composite. These ideas 
have been proposed before, but have not been published or clearly developed by [211], [344], 
[342], [168]. It seems that the basic QFT is practical and useful as a general algorithm 
for lengths up to a hundred or so. Above that, the chirp z-transform [64] or other filter 
based methods will be superior. For special cases and shorter lengths, methods based on 
Winograd’s theories will always be superior. Nevertheless, the QFT has a definite place in 
the array of DFT algorithms and is not well known. A Fortran program is included in the 
appendix. 


It is possible, but unlikely, that further arithmetic reduction could be achieved using the 
fact that Wy has unity magnitude as was done in second-order Goertzel algorithm. It is 
also possible that some way of combining the Goertzel and QFT algorithm would have some 
advantages. A development of a complete QFT decomposition of a DFT of length-2” shows 
interesting structure [156], [159] and arithmetic complexity comparable to average Cooley- 
Tukey FFTs. It does seem better suited to real data calculations with pruning. 
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Chapter 10 


The Prime Factor and Winograd Fourier 
Transform Algorithms’ 


The prime factor algorithm (PFA) and the Winograd Fourier transform algorithm (WFTA) 
are methods for efficiently calculating the DFT which use, and in fact, depend on the Type-1 
index map from Multidimensional Index Mapping: Equation 10 (3.10) and Multidimensional 
Index Mapping: Equation 6 (3.6). The use of this index map preceded Cooley and Tukey’s 
paper [150], [302] but its full potential was not realized until it was combined with Winograd’s 
short DFT algorithms. The modern PFA was first presented in [213] and a program given 
in [57]. The WFTA was first presented in [407] and programs given in [236], [83]. 


The number theoretic basis for the indexing in these algorithms may, at first, seem more 
complicated than in the Cooley-Tukey FFT; however, if approached from the general index 
mapping point of view of Multidimensional Index Mapping (Chapter 3), it is straightfor- 
ward, and part of a common approach to breaking large problems into smaller ones. The 
development in this section will parallel that in The Cooley-Tukey Fast Fourier Transform 
Algorithm (Chapter 9). 


The general index maps of Multidimensional Index Mapping: Equation 6 (3.6) and Mul- 
tidimensional Index Mapping: Equation 12 (3.12) must satisfy the Type-1 conditions of 
Multidimensional Index Mapping: Equation 7 (3.7) and Multidimensional Index Mapping: 
Equation 10 (3.10) which are 


ky = aNo and Ko = bN, with (11, Ni) = (Ko, No) = (10.1) 


Kz = cN» and Ky = dN, with (13, Ni) = (14, No) =1 (10.2) 


The row and column calculations in Multidimensional Index Mapping: Equation 15 (3.15) 
are uncoupled by Multidimensional Index Mapping: Equation 16 (3.16) which for this case 





!This content is available online at <http://cnx.org/content /m16335/1.9/>. 
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are 


(Ai Ka))y = ((A2K3))y = 0 (10.3) 


In addition, to make each short sum a DFT, the K; must also satisfy 


(1 K3)) y = No and ((K2Ka4)) y = Ni (10.4) 


In order to have the smallest values for K;, the constants in (10.1) are chosen to be 


e202 c= (NF) ay GANG (10.5) 
which gives for the index maps in (10.1) 


n= (Non, or Nin2)) y (10.6) 





k = ((Kski + Kaka) y (10.7) 


The frequency index map is a form of the Chinese remainder theorem. Using these index 
maps, the DFT in Multidimensional Index Mapping: Equation 15 (3.15) becomes 


No—1N-1 

KS) > 2 We (10.8) 
n2=0n;=0 

which is a pure two-dimensional DFT with no twiddle factors and the summations can be 

done in either order. Choices other than (10.5) could be used. For example, a = b = c = 

d = 1 will cause the input and output index map to be the same and, therefore, there will be 

no scrambling of the output order. The short summations in (96), however, will no longer 

be short DFT’s [57]. 


An important feature of the short Winograd DFT’s described in Winograd’s Short DFT Al- 
gorithms (Chapter 7) that is useful for both the PFA and WFTA is the fact that the multiplier 
constants in Winograd’s Short DFT Algorithms: Equation 6 (7.6) or Winograd’s Short DFT 
Algorithms: Equation 8 (7.8) are either real or imaginary, never a general complex number. 
For that reason, multiplication by complex data requires only two real multiplications, not 
four. That is a very significant feature. It is also true that the 7 multiplier can be commuted 
from the D operator to the last part of the A’ operator. This means the D operator has 
only real multipliers and the calculations on real data remains real until the last stage. This 
can be seen by examining the short DFT modules in [65], [198] and in the appendices. 


10.1 The Prime Factor Algorithm 

If the DFT is calculated directly using (10.8), the algorithm is called a prime factor algorithm 
[150], [302] and was discussed in Winograd’s Short DFT Algorithms (Chapter 7) and Multi- 
dimensional Index Mapping: In-Place Calculation of the DFT and Scrambling (Section 3.2: 
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In-Place Calculation of the DFT and Scrambling). When the short DFT’s are calculated 
by the very efficient algorithms of Winograd discussed in Factoring the Signal Processing 
Operators (Chapter 6), the PFA becomes a very powerful method that is as fast or faster 
than the best Cooley-Tukey FFT’s [57], [213]. 


A flow graph is not as helpful with the PFA as it was with the Cooley-Tukey FFT, how- 
ever, the following representation in Figure 10.1 which combines Figures Multidimensional 
Index Mapping: Figure 1 (Figure 3.1) and Winograd’s Short DFT Algorithms: Figure 2 
(Figure 7.2) gives a good picture of the algorithm with the example of Multidimensional 
Index Mapping: Equation 25 (3.25) 








Figure 10.1: A Prime Factor FFT for N = 15 





If N is factored into three factors, the DFT of (10.8) would have three nested summations 
and would be a three-dimensional DFT. This principle extends to any number of factors; 
however, recall that the Type-1 map requires that all the factors be relatively prime. A very 
simple three-loop indexing scheme has been developed [57| which gives a compact, efficient 
PFA program for any number of factors. The basic program structure is illustrated in p. ?? 
with the short DFT’s being omitted for clarity. Complete programs are given in [65] and in 
the appendices. 
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DO 10 K = 1, M 
Ni = NI(K) 
N2 = N/N1 
I(1) =1 
DO 20.3: ="1, -N2 
DO 30 L=2, N1 
I(L) = I(L-1) + N2 
IF (I(L .GT.N) I(L) = I(L) - N 
30 CONTINUE 
GOTO (20,102,103,104,105), N1 
I(i) = I(4) + N1 


20 CONTINUE 
10 CONTINUE 
RETURN 
CaSSbaceseag seas NODULE POR: Na2-=S-S2sseseescees 
102s Ri = X(I(1)) 


X(I(1)) = R1 + X(1(2)) 
X(I(2)) = R1 - X(I(2)) 
Ri = YUL) 
Y¥(I(1)) = R1 + Y(1(2)) 
Y(I(2)) = R41 - YC(I(2)) 
GOTO 20 
C---------------- OTHER MODULES------------------ 

103. Length-3 DFT 

104 + Length-4 DFT 

105 Length-5 DFT 
etc. 


Listing 10.1: Part of a FORTRAN PFA Program 





As in the Cooley-Tukey program, the DO 10 loop steps through the M stages (factors of 
N) and the DO 20 loop calculates the N/N1 length-N1 DFT’s. The input index map of 
(10.6) is implemented in the DO 30 loop and the statement just before label 20. In the PFA, 
each stage or factor requires a separately programmed module or butterfly. This lengthens 
the PFA program but an efficient Cooley-Tukey program will also require three or more 
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butterflies. 


Because the PFA is calculated in-place using the input index map, the output is scrambled. 
There are five approaches to dealing with this scrambled output. First, there are some ap- 
plications where the output does not have to be unscrambled as in the case of high-speed 
convolution. Second, an unscrambler can be added after the PFA to give the output in 
correct order just as the bit-reversed-counter is used for the Cooley-Tukey FFT. A simple 
unscrambler is given in [65], [57] but it is not in place. The third method does the unscram- 
bling in the modules while they are being calculated. This is probably the fastest method 
but the program must be written for a specific length [65], [57]. A fourth method is similar 
and achieves the unscrambling by choosing the multiplier constants in the modules properly 
[198]. The fifth method uses a separate indexing method for the input and output of each 
module [65], [320]. 


10.2 The Winograd Fourier Transform Algorithm 


The Winograd Fourier transform algorithm (WFTA) uses a very powerful property of the 
Type-1 index map and the DFT to give a further reduction of the number of multiplications 
in the PFA. Using an operator notation where F| represents taking row DFT’s and Fy, 
represents column DFT’s, the two-factor PFA of (10.8) is represented by 


X=F,F, x (10.9) 


It has been shown [410], [190] that if each operator represents identical operations on each 
row or column, they commute. Since F, and F, represent length N, and Ng DFT’s, they 
commute and (10.9) can also be written 


If each short DFT in F' is expressed by three operators as in Winograd’s Short DFT 
Algorithms: Equation 8 (7.8) and Winograd’s Short DFT Algorithms: Figure 2 (Figure 7.2), 
F can be factored as 


F=A'™DA (10.11) 


where A represents the set of additions done on each row or column that performs the 

residue reduction as Winograd’s Short DFT Algorithms: Equation 30 (7.30). Because of 
the appearance of the flow graph of A and because it is the first operator on 2, it is called 
a preweave operator [236]. D is the set of M multiplications and A’ (or B’ or C*) from 
Winograd’s Short DFT Algorithms: Equation 5 (7.5) or Winograd’s Short DFT Algorithms: 
Equation 6 (7.6) is the reconstruction operator called the postweave. Applying (10.11) to 
(10.9) gives 
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This is the PFA of (10.8) and Figure 10.1 where A; D,A, represents the row DFT’s on the 
array formed from x. Because these operators commute, (10.12) can also be written as 


102 


or 


but the two adjacent multiplication operators can be premultiplied and the result represented 
by one operator D = Dz D, which is no longer the same for each row or column. Equation 
(10.14) becomes 


X =A? A} DA, Ay & (10.15) 


This is the basic idea of the Winograd Fourier transform algorithm. The commuting of 
the multiplication operators together in the center of the algorithm is called nesting and it 
results in a significant decrease in the number of multiplications that must be done at the 
execution of the algorithm. Pictorially, the PFA of Figure 10.1 becomes [213] the WFTA in 
Figure 10.2. 
















i 
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Figure 10.2: A Length-15 WFTA with Nested Multiplications 





The rectangular structure of the preweave addition operators causes an expansion of the data 
in the center of the algorithm. The 15 data points in Figure 10.2 become 18 intermediate 
values. This expansion is a major problem in programming the WFTA because it prevents 
a straightforward in-place calculation and causes an increase in the number of required 
additions and in the number of multiplier constants that must be precalculated and stored. 


From Figure 10.2 and the idea of premultiplying the individual multiplication operators, it 
can be seen why the multiplications by unity had to be considered in Winograd’s Short DFT 
Algorithms: Table 1 (Table 7.1). Even if a multiplier in D, is unity, it may not be in DD. 
In Figure 10.2 with factors of three and five, there appear to be 18 multiplications required 


Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


103 


because of the expansion of the length-5 preweave operator, Az, however, one of multipliers 
in each of the length three and five operators is unity, so one of the 18 multipliers in the 
product is unity. This gives 17 required multiplications - a rather impressive reduction from 
the 15? = 225 multiplications required by direct calculation. This number of 17 complex 
multiplications will require only 34 real multiplications because, as mentioned earlier, the 
multiplier constants are purely real or imaginary while the 225 complex multiplications are 
general and therefore will require four times as many real multiplications. 


The number of additions depends on the order of the pre- and postweave operators. For 
example in the length-15 WFTA in Figure 10.2, if the length-5 had been done first and last, 
there would have been six row addition preweaves in the preweave operator rather than the 
five shown. It is difficult to illustrate the algorithm for three or more factors of N, but the 
ideas apply to any number of factors. Each length has an optimal ordering of the pre- and 
postweave operators that will minimize the number of additions. 


A program for the WFTA is not as simple as for the FFT or PFA because of the very 
characteristic that reduces the number of multiplications, the nesting. A simple two-factor 
example program is given in [65] and a general program can be found in [236], [83]. The 
same lengths are possible with the PFA and WFTA and the same short DFT modules can be 
used, however, the multiplies in the modules must occur in one place for use in the WFTA. 


10.3 Modifications of the PFA and WFTA Type Algo- 
rithms 


In the previous section it was seen how using the permutation property of the elementary 
operators in the PFA allowed the nesting of the multiplications to reduce their number. It 
was also seen that a proper ordering of the operators could minimize the number of additions. 
These ideas have been extended in formulating a more general algorithm optimizing problem. 
If the DFT operator F in (10.11) is expressed in a still more factored form obtained from 
Winograd’s Short DFT Algorithms: Equation 30 (7.30), a greater variety of ordering can be 
optimized. For example if the A operators have two factors 


F,=ATAs D, A\Ai (10.16) 
The DFT in (10.10) becomes 


X = APA,” DpA 2A: A) Ay! DA 1 Ay (10.17) 


The operator notation is very helpful in understanding the central ideas, but may hide some 
important facts. It has been shown [410], [198] that operators in different F; commute with 
each other, but the order of the operators within an F; cannot be changed. They represent 
the matrix multiplications in Winograd’s Short DFT Algorithms: Equation 30 (7.30) or 
Winograd’s Short DFT Algorithms: Equation 8 (7.8) which do not commute. 
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This formulation allows a very large set of possible orderings, in fact, the number is so 
large that some automatic technique must be used to find the “best". It is possible to set 
up a criterion of optimality that not only includes the number of multiplications but the 
number of additions as well. The effects of relative multiply-add times, data transfer times, 
CPU register and memory sizes, and other hardware characteristics can be included in the 
criterion. Dynamic programming can then be applied to derive an optimal algorithm for 
a particular application [190]. This is a very interesting idea as there is no longer a single 
algorithm, but a class and an optimizing procedure. The challenge is to generate a broad 
enough class to result in a solution that is close to a global optimum and to have a practical 
scheme for finding the solution. 


Results obtained applying the dynamic programming method to the design of fairly long 
DFT algorithms gave algorithms that had fewer multiplications and additions than either 
a pure PFA or WFTA [190]. It seems that some nesting is desirable but not total nesting 
for four or more factors. There are also some interesting possibilities in mixing the Cooley- 
Tukey with this formulation. Unfortunately, the twiddle factors are not the same for all rows 
and columns, therefore, operations cannot commute past a twiddle factor operator. There 
are ways of breaking the total algorithm into horizontal paths and using different orderings 
along the different paths [264], [198]. In a sense, this is what the split-radix FFT does with 
its twiddle factors when compared to a conventional Cooley-Tukey FFT. 


There are other modifications of the basic structure of the Type-1 index map DFT algorithm. 
One is to use the same index structure and conversion of the short DFT’s to convolution 
as the PFA but to use some other method for the high-speed convolution. Table look-up of 
partial products based on distributed arithmetic to eliminate all multiplications [78] looks 
promising for certain very specific applications, perhaps for specialized VLSI implementation. 
Another possibility is to calculate the short convolutions using number-theoretic transforms 
[30], [236], [264]. This would also require special hardware. Direct calculation of short 
convolutions is faster on certain pipelined processor such as the TMS-320 microprocessor 
[216]. 


10.4 Evaluation of the PFA and WFTA 


As for the Cooley-Tukey FFT’s, the first evaluation of these algorithms will be on the number 
of multiplications and additions required. The number of multiplications to compute the PFA 
in (10.8) is given by Multidimensional Index Mapping: Equation 3 (3.3). Using the notation 
that T (NV) is the number of multiplications or additions necessary to calculate a length-N 
DFT, the total number for a four-factor PFA of length-N, where N = N,N2N3N, is 


T (N) = NiNoNsT (Na) + NoN3NaT (Ni) + N3NaNiT (No) + NaNiNoT (N3) (10.18) 


The count of multiplies and adds in Table 10.1 are calculated from (105) with the counts of 
the factors taken from Winograd’s Short DFT Algorithms: Table 1 (Table 7.1). The list of 
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lengths are those possible with modules in the program of length 2, 3, 4, 5, 7, 8, 9 and 16 as 
is true for the PFA in [65], [57] and the WFTA in [236], [83]. A maximum of four relatively 
prime lengths can be used from this group giving 59 different lengths over the range from 
2 to 5040. The radix-2 or split-radix FFT allows 12 different lengths over the same range. 
If modules of length 11 and 13 from [188] are added, the maximum length becomes 720720 
and the number of different lengths becomes 239. Adding modules for 17, 19 and 25 from 
[188] gives a maximum length of 1163962800 and a very large and dense number of possible 
lengths. The length of the code for the longer modules becomes excessive and should not be 
included unless needed. 


The number of multiplications necessary for the WFTA is simply the product of those 
necessary for the required modules, including multiplications by unity. The total number may 
contain some unity multipliers but it is difficult to remove them in a practical program. Table 
10.1 contains both the total number (MULTS) and the number with the unity multiplies 
removed (RMULTS). 


Calculating the number of additions for the WFTA is more complicated than for the PFA 
because of the expansion of the data moving through the algorithm. For example the number 
of additions, TA, for the length-15 example in Figure 10.2 is given by 


TA(N) = NoTA(N1) + TMiTA (No) (10.19) 


where N,; = 3, No = 5, TM, = the number of multiplies for the length-3 module and 
hence the expansion factor. As mentioned earlier there is an optimum ordering to minimize 
additions. The ordering used to calculate Table 10.1 is the ordering used in [236], [83] which 
is optimal in most cases and close to optimal in the others. 









































Length | PFA | PFA | WFTA | WFTA | WFTA 

N Mults | Adds | Mults | RMults | Adds 

10 20 88 24 20 88 

12 16 96 24 16 96 

14 32 172 36 32 172 

15 50 162 36 34 162 
continued on next page 








Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


CHAPTER 10. THE PRIME FACTOR AND WINOGRAD 


cs FOURIER TRANSFORM ALGORITHMS 





18 | 40 | 204 | 44 | 40 | 208 
20 | 40 | 216 | 48 | 40 | 216 
21 | 76 | 300 | 54 | 52 | 300 
24 | 44 | 252 | 48 | 36 | 252 
28 | 64 | 400 | 72 | 64 | 400 
30 | 100 | 384 | 72 | 68 | 384 
35 | 150 | 598 | 108 | 106 | 666 
36 | 80 | 480 | 88 | 80 | 488 
40 | 100 | 532 | 96 | 84 | 532 
42 | 152 | 684 | 108 |) 104 | 684 
45 | 190 | 726 | 132 | 130 | 804 
48 | 124 | 636 | 108 | 92 | 660 
56 | 156 | 940 | 144 | 132 | 940 
60 | 200 | 888 | 144 | 136 | 888 
63 | 284 | 1236 | 198 | 196 | 1394 
70 | 300 | 1336 | 216 | 212 | 1472 
72 | 196 | 1140 | 176 | 164 | 1156 
80 | 260 | 1284 | 216 | 200 | 1352 
84 | 304 | 1536 | 216 | 208 | 1536 
90 | 380 | 1632 | 264 | 260 | 1788 
105 | 590 | 2214 | 324 | 322 | 2418 
112 | 396 | 2188 | 324 | 308 | 2332 
120 | 460 | 2076 | 288 | 276 | 2076 
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126 | 568 2724 396 392 3040 
140 | 600 2952 432 A24 3224 
144 | 500 2676 396 380 2880 
168 | 692 3492 432 420 3492 
180 | 760 3624 528 520 3936 
210 | 1180 | 4848 648 644 5256 
240 | 1100 | 4812 648 632 5136 
292 | 1136 | 5952 792 784 6584 
280 | 1340 | 6604 864 852 7148 
315 | 2050 | 8322 1188 | 1186 | 10336 
336 | 1636 | 7908 972 956 8508 
360 | 1700 | 8148 1056 | 1044 | 8772 
420 | 2360 | 105386 | 1296 | 1288 | 11352 
504 | 2524 | 13164 | 1584 | 1572 | 14428 
560 | 3100 | 14748 | 1944 | 1928 | 17168 
630 | 4100 | 17904 | 2376 | 2372 | 21932 
720 | 3940 | 18276 | 2376 | 2360 | 21132 
840 | 5140 | 23172 | 2592 | 2580 | 24804 
1008 | 5804 | 29100 | 3564 | 3548 | 34416 
1260 | 8200 | 38328 | 4752 | 4744 | 46384 
1680 | 11540 | 50964 | 5832 | 5816 | 59064 
2520 | 17660 | 82956 | 9504 | 9492 | 99068 
5040 | 39100 | 179772 | 21384 | 21368 | 232668 
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Table 10.1: Number of Real Multiplications and Additions for Complex PFA and WFTA 
FFTs 


from Table 10.1 we see that compared to the PFA or any of the Cooley-Tukey FFT’s, the 
WFTA has significantly fewer multiplications. For the shorter lengths, the WFTA and the 
PFA have approximately the same number of additions; however for longer lengths, the PFA 
has fewer and the Cooley-Tukey FFT’s always have the fewest. If the total arithmetic, the 
number of multiplications plus the number of additions, is compared, the split-radix FFT, 
PFA and WFTA all have about the same count. Special versions of the PFA and WFTA 


have been developed for real data [178], [358]. 
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The size of the Cooley-Tukey program is the smallest, the PFA next and the WFTA largest. 
The PFA requires the smallest number of stored constants, the Cooley-Tukey or split-radix 
FFT next, and the WFTA requires the largest number. For a DFT of approximately 1000, the 
PFA stores 28 constants, the FFT 2048 and the WFTA 3564. Both the FFT and PFA can be 
calculated in-place and the WFTA cannot. The PFA can be calculated in-order without an 
unscrambler. The radix-2 FFT can also, but it requires additional indexing overhead [194]. 
The indexing and data transfer overhead is greatest for the WFTA because the separate 
preweave and postweave sections each require their indexing and pass through the complete 
data. The shorter modules in the PFA and WFTA and the butterflies in the radix 2 and 
4 FFT’s are more efficient than the longer ones because intermediate calculations can be 
kept in cpu registers rather general memory [250]. However, the shorter modules and radices 
require more passes through the data for a given approximate length. A proper comparison 
will require actual programs to be compiled and run on a particular machine. There are 
many open questions about the relationship of algorithms and hardware architecture. 


108 


Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


Chapter 11 


Implementing FF'T’s in Practice’ 


by Steven G. Johnson (Department of Mathematics, Massachusetts Institute of Technology) 
and Matteo Frigo (Cilk Arts, Inc.) 


11.1 Introduction 


Although there are a wide range of fast Fourier transform (FFT) algorithms, involving a 
wealth of mathematics from number theory to polynomial algebras, the vast majority of 
FFT implementations in practice employ some variation on the Cooley-Tukey algorithm 
[92]. The Cooley-Tukey algorithm can be derived in two or three lines of elementary algebra. 
It can be implemented almost as easily, especially if only power-of-two sizes are desired; 
numerous popular textbooks list short FFT subroutines for power-of-two sizes, written in 
the language du jour. The implementation of the Cooley-Tukey algorithm, at least, would 
therefore seem to be a long-solved problem. In this chapter, however, we will argue that 
matters are not as straightforward as they might appear. 


For many years, the primary route to improving upon the Cooley-Tukey FFT seemed to be 
reductions in the count of arithmetic operations, which often dominated the execution time 
prior to the ubiquity of fast floating-point hardware (at least on non-embedded processors). 
Therefore, great effort was expended towards finding new algorithms with reduced arith- 
metic counts [114], from Winograd’s method to achieve © (n) multiplications? (at the cost 
of many more additions) [411], [180], [116], [114] to the split-radix variant on Cooley-Tukey 
that long achieved the lowest known total count of additions and multiplications for power- 
of-two sizes [422], [107], [391], [230], [114] (but was recently improved upon [202], [225]). 
The question of the minimum possible arithmetic count continues to be of fundamental the- 
oretical interest—it is not even known whether better than © (nlogn) complexity is possible, 
since ()(nlogn) lower bounds on the count of additions have only been proven subject to 





'This content is available online at <http://cnx.org/content /m16336/1.15/>. 
?We employ the standard asymptotic notation of O for asymptotic upper bounds, © for asymptotic tight 
bounds, and 2 for asymptotic lower bounds [210]. 


Available for free at. Connexions <http://cnx.org/content /col10550/1.22> 


109 


110 CHAPTER 11. IMPLEMENTING FFTS IN PRACTICE 


restrictive assumptions about the algorithms [248], [280], [281]. Nevertheless, the difference 
in the number of arithmetic operations, for power-of-two sizes n, between the 1965 radix-2 
Cooley-Tukey algorithm (~ 5nlogyn [92]) and the currently lowest-known arithmetic count 
(~ %nlogyn [202], [225]) remains only about 25%. 
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Figure 11.1: The ratio of speed (1/time) between a highly optimized FFT (FFTW 
3.1.2 [133], [134]) and a typical textbook radix-2 implementation (Numerical Recipes in 
C [290]) on a 3 GHz Intel Core Duo with the Intel C compiler 9.1.043, for single-precision 
complex-data DFTs of size n, plotted versus loggn. Top line (squares) shows FF TW 
with SSE SIMD instructions enabled, which perform multiple arithmetic operations at 
once (see section ); bottom line (circles) shows FFTW with SSE disabled, which thus 
requires a similar number of arithmetic instructions to the textbook code. (This is not 
intended as a criticism of Numerical Recipes—simple radix-2 implementations are rea- 
sonable for pedagogy—but it illustrates the radical differences between straightforward 
and optimized implementations of FFT algorithms, even with similar arithmetic costs.) 
For n => 2!9, the ratio increases because the textbook code becomes much slower (this 
happens when the DFT size exceeds the level-2 cache). 





And yet there is a vast gap between this basic mathematical theory and the actual practice— 
highly optimized FFT packages are often an order of magnitude faster than the textbook 
subroutines, and the internal structure to achieve this performance is radically different 
from the typical textbook presentation of the “same” Cooley-Tukey algorithm. For example, 
Figure 11.1 plots the ratio of benchmark speeds between a highly optimized FFT [133], [134] 
and a typical textbook radix-2 implementation [290], and the former is faster by a factor 
of 5-40 (with a larger ratio as n grows). Here, we will consider some of the reasons for 
this discrepancy, and some techniques that can be used to address the difficulties faced by a 
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practical high-performance FFT implementation.? 


In particular, in this chapter we will discuss some of the lessons learned and the strategies 
adopted in the FFTW library. FFTW [133], [134] is a widely used free-software library that 
computes the discrete Fourier transform (DFT) and its various special cases. Its performance 
is competitive even with manufacturer-optimized programs [134], and this performance is 
portable thanks the structure of the algorithms employed, self-optimization techniques, 
and highly optimized kernels (FFTW’s codelets) generated by a special-purpose compiler. 


This chapter is structured as follows. First "Review of the Cooley-Tukey FFT" (Section 11.2: 
Review of the Cooley-Tukey FFT), we briefly review the basic ideas behind the Cooley- 
Tukey algorithm and define some common terminology, especially focusing on the many 
degrees of freedom that the abstract algorithm allows to implementations. Next, in "Goals 
and Background of the FFTW Project" (Section 11.3: Goals and Background of the FFTW 
Project), we provide some context for FFTW’s development and stress that performance, 
while it receives the most publicity, is not necessarily the most important consideration in 
the implementation of a library of this sort. Third, in "FFTs and the Memory Hierarchy" 
(Section 11.4: FFTs and the Memory Hierarchy), we consider a basic theoretical model 
of the computer memory hierarchy and its impact on FFT algorithm choices: quite general 
considerations push implementations towards large radices and explicitly recursive structure. 
Unfortunately, general considerations are not sufficient in themselves, so we will explain in 
"Adaptive Composition of FFT Algorithms" (Section 11.5: Adaptive Composition of FFT 
Algorithms) how FF TW self-optimizes for particular machines by selecting its algorithm at 
runtime from a composition of simple algorithmic steps. Furthermore, "Generating Small 
FFT Kernels" (Section 11.6: Generating Small FFT Kernels) describes the utility and the 
principles of automatic code generation used to produce the highly optimized building blocks 
of this composition, FFTW’s codelets. Finally, we will briefly consider an important non- 
performance issue, in "Numerical Accuracy in FFTs" (Section 11.7: Numerical Accuracy in 
FFTs). 


11.2 Review of the Cooley-Tukey FFT 


The (forward, one-dimensional) discrete Fourier transform (DFT) of an array X of n complex 
numbers is the array Y given by 


Y [A] = 5° X (uh, (11.1) 


where 0 < k < n and w, = exp (—2zi/n) is a primitive root of unity. Implemented directly, 
(11.1) would require O(n”) operations; fast Fourier transforms are O (nlogn) algorithms 





2We won’t address the question of parallelization on multi-processor machines, which adds even greater 
difficulty to FFT implementation—although multi-processors are increasingly important, achieving good 
serial performance is a basic prerequisite for optimized parallel code, and is already hard enough! 
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to compute the same result. The most important FFT (and the one primarily used in 
FF TW) is known as the “Cooley-Tukey” algorithm, after the two authors who rediscovered 
and popularized it in 1965 [92], although it had been previously known as early as 1805 by 
Gauss as well as by later re-inventors [173]. The basic idea behind this FFT is that a DFT 
of a composite size n = n nz can be re-expressed in terms of smaller DFTs of sizes n; and 
n2—essentially, as a two-dimensional DFT of size n; x nz where the output is transposed. 
The choices of factorizations of n, combined with the many different ways to implement the 
data re-orderings of the transpositions, have led to numerous implementation strategies for 
the Cooley-Tukey FFT, with many variants distinguished by their own names [114], [389]. 
FFTW implements a space of many such variants, as described in "Adaptive Composition 
of FFT Algorithms" (Section 11.5: Adaptive Composition of FFT Algorithms), but here we 
derive the basic algorithm, identify its key features, and outline some important historical 
variations and their relation to FFTW. 


The Cooley-Tukey algorithm can be derived as follows. If n can be factored into n = ning, 
(11.1) can be rewritten by letting 0 = ¢;ng + 2 and k = ki + kony. We then have: 


ng—-1 ny-1 
Y [Ay + kony| = S° (5: xX [Cine + C5] “at ugh 


l2=0 £1,=0 


war (11.2) 


nag ? 





where kj 2 = 0,...,n1,2 — 1. Thus, the algorithm computes no DFTs of size n, (the inner 
sum), multiplies the result by the so-called [139] twiddle factors w‘?*', and finally computes 
n, DFTs of size nz (the outer sum). This decomposition is then continued recursively. The 
literature uses the term radix to describe an n, or nz that is bounded (often constant); the 
small DFT of the radix is traditionally called a butterfly. 


Many well-known variations are distinguished by the radix alone. A decimation in time 
(DIT) algorithm uses nz as the radix, while a decimation in frequency (DIF) algorithm 
uses n, as the radix. If multiple radices are used, e.g. for n composite but not a prime power, 
the algorithm is called mixed radix. A peculiar blending of radix 2 and 4 is called split 
radix, which was proposed to minimize the count of arithmetic operations [422], [107], [391], 
[230], [114] although it has been superseded in this regard [202], [225]. FFTW implements 
both DIT and DIF, is mixed-radix with radices that are adapted to the hardware, and 
often uses much larger radices (e.g. radix 32) than were once common. On the other end of 
the scale, a “radix” of roughly \/n has been called a four-step FFT algorithm (or six-step, 
depending on how many transposes one performs) [14]; see "FFTs and the Memory Hier- 
archy" (Section 11.4: FFTs and the Memory Hierarchy) for some theoretical and practical 
discussion of this algorithm. 


A key difficulty in implementing the Cooley-Tukey FFT is that the n; dimension corresponds 
to discontiguous inputs ¢; in X but contiguous outputs k, in Y, and vice-versa for nz. This 
is a matrix transpose for a single decomposition stage, and the composition of all such 
transpositions is a (mixed-base) digit-reversal permutation (or bit-reversal, for radix 2). 
The resulting necessity of discontiguous memory access and data re-ordering hinders efficient 
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use of hierarchical memory architectures (e.g., caches), so that the optimal execution order 
of an FFT for given hardware is non-obvious, and various approaches have been proposed. 





breadth-first depth-first 


See ede 
G2 Es 
(2) (2) {2 2) 





Figure 11.2: Schematic of traditional breadth-first (left) vs. recursive depth-first (right) 
ordering for radix-2 FFT of size 8: the computations for each nested box are completed 
before doing anything else in the surrounding box. Breadth-first computation performs 
all butterflies of a given size at once, while depth-first computation completes one sub- 
transform entirely before moving on to the next (as in the algorithm below). 





One ordering distinction is between recursion and iteration. As expressed above, the Cooley- 
Tukey algorithm could be thought of as defining a tree of smaller and smaller DFTs, as 
depicted in Figure 11.2; for example, a textbook radix-2 algorithm would divide size n into 
two transforms of size n/2, which are divided into four transforms of size n/4, and so on 
until a base case is reached (in principle, size 1). This might naturally suggest a recursive 
implementation in which the tree is traversed “depth-first” as in Figure 11.2(right) and the 
algorithm of p. ??—one size n/2 transform is solved completely before processing the other 
one, and so on. However, most traditional FFT implementations are non-recursive (with rare 
exceptions [341]) and traverse the tree “breadth-first” [389] as in Figure 11.2(left)—in the 
radix-2 example, they would perform n (trivial) size-1 transforms, then n/2 combinations 
into size-2 transforms, then n/4 combinations into size-4 transforms, and so on, thus making 
logyn passes over the whole array. In contrast, as we discuss in "Discussion" (Section 11.5.2.6: 
Discussion), FFTW employs an explicitly recursive strategy that encompasses both depth- 
first and breadth-first styles, favoring the former since it has some theoretical and practical 
advantages as discussed in "FFTs and the Memory Hierarchy" (Section 11.4: FFTs and the 
Memory Hierarchy). 
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Y (0,....27-—1] -recfft 2(n,X,e ): 
IF n=1 THEN 
Y [0] — X [0] 
ELSE 
..,n/2—1] — recf ft2 (n/2, X, 2c ) 
/2, .n—1]) — recf ft2(n/2,X+¢4 ,20 ) 
k_ 10 TO (n/2)—1 DO 
t<— Y [k_1] 
Y [kl] —t+w _n°k_ly [k_1+n/2] 
Y [k_1+n/2] ~t—w _n°k_1lY [k_14+n/2] 
END FOR 
END IF 


Y (0,. 
¥ [n 
FOR 


Listing 11.1: A depth-first recursive radix-2 DIT Cooley-Tukey FFT to compute a DFT 
of a power-of-two size n = 2. The input is an array X of length n with stride z (i.e., the 
inputs are X [¢c] for € = 0,...,2—1) and the output is an array Y of length n (with stride 
1), containing the DFT of X [Equation 1]. X +c denotes the array beginning with X [v]. 
This algorithm operates out-of-place, produces in-order output, and does not require a 
separate bit-reversal stage. 





A second ordering distinction lies in how the digit-reversal is performed. The classic approach 
is a single, separate digit-reversal pass following or preceding the arithmetic computations; 
this approach is so common and so deeply embedded into FFT lore that many practitioners 
find it difficult to imagine an FFT without an explicit bit-reversal stage. Although this 
pass requires only O(n) time [207], it can still be non-negligible, especially if the data is 
out-of-cache; moreover, it neglects the possibility that data reordering during the transform 
may improve memory locality. Perhaps the oldest alternative is the Stockham auto-sort 
FFT [367], [389], which transforms back and forth between two arrays with each butterfly, 
transposing one digit each time, and was popular to improve contiguity of access for vector 
computers [372]. Alternatively, an explicitly recursive style, as in FFTW, performs the digit- 
reversal implicitly at the “leaves” of its computation when operating out-of-place (see section 
"Discussion" (Section 11.5.2.6: Discussion)). A simple example of this style, which computes 
in-order output using an out-of-place radix-2 FFT without explicit bit-reversal, is shown in 
the algorithm of p. ?? [corresponding to Figure 11.2(right)]. To operate in-place with O (1) 
scratch storage, one can interleave small matrix transpositions with the butterflies [195], 
[375], [297], [166], and a related strategy in FF TW [134] is briefly described by "Discussion" 
(Section 11.5.2.6: Discussion). 


Finally, we should mention that there are many FFTs entirely distinct from Cooley-Tukey. 
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Three notable such algorithms are the prime-factor algorithm for gcd (ni,n2) = 1 [278], 
along with Rader’s [309] and Bluestein’s [35], [305], [278] algorithms for prime n. FFTW 
implements the first two in its codelet generator for hard-coded n "Generating Small FFT 
Kernels" (Section 11.6: Generating Small FFT Kernels) and the latter two for general prime 
n (sections "Plans for prime sizes" (Section 11.5.2.5: Plans for prime sizes) and "Goals and 
Background of the FFTW Project" (Section 11.3: Goals and Background of the FFTW 
Project)). There is also the Winograd FFT [411], [180], [116], [114], which minimizes the 
number of multiplications at the expense of a large number of additions; this trade-off is not 
beneficial on current processors that have specialized hardware multipliers. 


11.3 Goals and Background of the FF TW Project 


The FFTW project, begun in 1997 as a side project of the authors Frigo and Johnson as 
graduate students at MIT, has gone through several major revisions, and as of 2008 consists 
of more than 40,000 lines of code. It is difficult to measure the popularity of a free-software 
package, but (as of 2008) FFTW has been cited in over 500 academic papers, is used in 
hundreds of shipping free and proprietary software packages, and the authors have received 
over 10,000 emails from users of the software. Most of this chapter focuses on performance 
of FFT implementations, but FFTW would probably not be where it is today if that were 
the only consideration in its design. One of the key factors in FFTW’s success seems to 
have been its flexibility in addition to its performance. In fact, FF TW is probably the most 
flexible DFT library available: 


e FFTW is written in portable C and runs well on many architectures and operating 
systems. 

e FFTW computes DFTs in O (nlogn) time for any length n. (Most other DFT imple- 
mentations are either restricted to a subset of sizes or they become © (n”) for certain 
values of n, for example when n is prime.) 

e FFTW imposes no restrictions on the rank (dimensionality) of multi-dimensional trans- 
forms. (Most other implementations are limited to one-dimensional, or at most two- 
and three-dimensional data.) 

e FFTW supports multiple and/or strided DFTs; for example, to transform a 3- 
component vector field or a portion of a multi-dimensional array. (Most implemen- 
tations support only a single DFT of contiguous data.) 

e FFTW supports DFTs of real data, as well as of real symmetric/anti-symmetric data 
(also called discrete cosine/sine transforms). 


Our design philosophy has been to first define the most general reasonable functionality, and 
then to obtain the highest possible performance without sacrificing this generality. In this 
section, we offer a few thoughts about why such flexibility has proved important, and how 
it came about that FFTW was designed in this way. 
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FFTW’s generality is partly a consequence of the fact the FFTW project was started in 
response to the needs of a real application for one of the authors (a spectral solver for 
Maxwell’s equations [204]), which from the beginning had to run on heterogeneous hardware. 
Our initial application required multi-dimensional DFTs of three-component vector fields 
(magnetic fields in electromagnetism), and so right away this meant: (i) multi-dimensional 
FFTs; (ii) user-accessible loops of FFTs of discontiguous data; (iii) efficient support for non- 
power-of-two sizes (the factor of eight difference between n x n x n and 2n x 2n x 2n was 
too much to tolerate); and (iv) saving a factor of two for the common real-input case was 
desirable. That is, the initial requirements already encompassed most of the features above, 
and nothing about this application is particularly unusual. 


Even for one-dimensional DFTs, there is a common misperception that one should always 
choose power-of-two sizes if one cares about efficiency. Thanks to FFTW’s code generator 
(described in "Generating Small FFT Kernels" (Section 11.6: Generating Small FFT Ker- 
nels)), we could afford to devote equal optimization effort to any n with small factors (2, 3, 
5, and 7 are good), instead of mostly optimizing powers of two like many high-performance 
FFTs. As a result, to pick a typical example on the 3 GHz Core Duo processor of Figure 11.1, 
n = 3600 = 2*-3?- 5? and n = 3840 = 2°- 3-5 both execute faster than n = 4096 = 2". 
(And if there are factors one particularly cares about, one can generate code for them too.) 


One initially missing feature was efficient support for large prime sizes; the conventional wis- 
dom was that large-prime algorithms were mainly of academic interest, since in real applica- 
tions (including ours) one has enough freedom to choose a highly composite transform size. 
However, the prime-size algorithms are fascinating, so we implemented Rader’s O (nlogn) 
prime-n algorithm [309] purely for fun, including it in FFTW 2.0 (released in 1998) as a 
bonus feature. The response was astonishingly positive—even though users are (probably) 
never forced by their application to compute a prime-size DFT, it is rather inconvenient to 
always worry that collecting an unlucky number of data points will slow down one’s analysis 
by a factor of a million. The prime-size algorithms are certainly slower than algorithms for 
nearby composite sizes, but in interactive data-analysis situations the difference between 1 
ms and 10 ms means little, while educating users to avoid large prime factors is hard. 


Another form of flexibility that deserves comment has to do with a purely technical aspect of 
computer software. FFTW’s implementation involves some unusual language choices inter- 
nally (the FFT-kernel generator, described in "Generating Small FFT Kernels" (Section 11.6: 
Generating Small FFT Kernels), is written in Objective Caml, a functional language espe- 
cially suited for compiler-like programs), but its user-callable interface is purely in C with 
lowest-common-denominator datatypes (arrays of floating-point values). The advantage of 
this is that FFTW can be (and has been) called from almost any other programming lan- 
guage, from Java to Perl to Fortran 77. Similar lowest-common-denominator interfaces are 
apparent in many other popular numerical libraries, such as LAPACK [10]. Language prefer- 
ences arouse strong feelings, but this technical constraint means that modern programming 
dialects are best hidden from view for a numerical library. 
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Ultimately, very few scientific-computing applications should have performance as their top 
priority. Flexibility is often far more important, because one wants to be limited only by one’s 
imagination, rather than by one’s software, in the kinds of problems that can be studied. 


11.4 FFTs and the Memory Hierarchy 


There are many complexities of computer architectures that impact the optimization of FFT 
implementations, but one of the most pervasive is the memory hierarchy. On any modern 
general-purpose computer, memory is arranged into a hierarchy of storage devices with in- 
creasing size and decreasing speed: the fastest and smallest memory being the CPU registers, 
then two or three levels of cache, then the main-memory RAM, then external storage such 
as hard disks.* Most of these levels are managed automatically by the hardware to hold the 
most-recently-used data from the next level in the hierarchy.> There are many complications, 
however, such as limited cache associativity (which means that certain locations in memory 
cannot be cached simultaneously) and cache lines (which optimize the cache for contiguous 
memory access), which are reviewed in numerous textbooks on computer architectures. In 
this section, we focus on the simplest abstract principles of memory hierarchies in order to 
grasp their fundamental impact on FFTs. 


Because access to memory is in many cases the slowest part of the computer, especially 
compared to arithmetic, one wishes to load as much data as possible in to the faster levels 
of the hierarchy, and then perform as much computation as possible before going back to 
the slower memory devices. This is called temporal locality: if a given datum is used 
more than once, we arrange the computation so that these usages occur as close together as 
possible in time. 


11.4.1 Understanding FFTs with an ideal cache 


To understand temporal-locality strategies at a basic level, in this section we will employ an 
idealized model of a cache in a two-level memory hierarchy, as defined in [137]. This ideal 
cache stores Z data items from main memory (e.g. complex numbers for our purposes): 
when the processor loads a datum from memory, the access is quick if the datum is already 
in the cache (a cache hit) and slow otherwise (a cache miss, which requires the datum to 
be fetched into the cache). When a datum is loaded into the cache,® it must replace some 





4A hard disk is utilized by “out-of-core” FFT algorithms for very large n [389], but these algorithms 
appear to have been largely superseded in practice by both the gigabytes of memory now common on 
personal computers and, for extremely large n, by algorithms for distributed-memory parallel computers. 

>This includes the registers: on current “x86” processors, the user-visible instruction set (with a small 
number of floating-point registers) is internally translated at runtime to RISC-like “-ops” with a much larger 
number of physical rename registers that are allocated automatically. 

®More generally, one can assume that a cache line of L consecutive data items are loaded into the cache 
at once, in order to exploit spatial locality. The ideal-cache model in this case requires that the cache be 
tall: Z = (L*) [137]. 
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other datum, and the ideal-cache model assumes that the optimal replacement strategy is 
used [20]: the new datum replaces the datum that will not be needed for the longest time in 
the future; in practice, this can be simulated to within a factor of two by replacing the least- 
recently used datum [137], but ideal replacement is much simpler to analyze. Armed with 
this ideal-cache model, we can now understand some basic features of FFT implementations 
that remain essentially true even on real cache architectures. In particular, we want to know 
the cache complexity, the number Q (n; Z) of cache misses for an FFT of size n with an 
ideal cache of size Z, and what algorithm choices reduce this complexity. 


First, consider a textbook radix-2 algorithm, which divides n by 2 at each stage and operates 
breadth-first as in Figure 11.2(left), performing all butterflies of a given size at a time. If 
n > Z, then each pass over the array incurs O(n) cache misses to reload the data, and 
there are logyn passes, for © (nlog,n) cache misses in total—no temporal locality at all is 
exploited! 


One traditional solution to this problem is blocking: the computation is divided into maxi- 
mal blocks that fit into the cache, and the computations for each block are completed before 
moving on to the next block. Here, a block of Z numbers can fit into the cache’ (not in- 
cluding storage for twiddle factors and so on), and thus the natural unit of computation is 
a sub-FFT of size Z. Since each of these blocks involves 0 (ZlogZ) arithmetic operations, 
and there are O(nlogn) operations overall, there must be O (Zlogzn) such blocks. More 
explicitly, one could use a radix-Z Cooley-Tukey algorithm, breaking n down by factors of 
Z |or O(Z)| until a size Z is reached: each stage requires n/Z blocks, and there are log zn 
stages, again giving O (4logzn) blocks overall. Since each block requires Z cache misses to 
load it into cache, the cache complexity Q» of such a blocked algorithm is 


Onin: Z) = © (nlogsn). (11.3) 


In fact, this complexity is rigorously optimal for Cooley-Tukey FFT algorithms [184], and 
immediately points us towards large radices (not radix 2!) to exploit caches effectively in 
FFTs. 


However, there is one shortcoming of any blocked FFT algorithm: it is cache aware, mean- 
ing that the implementation depends explicitly on the cache size Z. The implementation 
must be modified (e.g. changing the radix) to adapt to different machines as the cache size 
changes. Worse, as mentioned above, actual machines have multiple levels of cache, and 
to exploit these one must perform multiple levels of blocking, each parameterized by the 
corresponding cache size. In the above example, if there were a smaller and faster cache 
of size z < Z, the size-Z sub-FFTs should themselves be performed via radix-z Cooley- 
Tukey using blocks of size z. And so on. There are two paths out of these difficulties: one 
is self-optimization, where the implementation automatically adapts itself to the hardware 





7Of course, O(n) additional storage may be required for twiddle factors, the output data (if the FFT is 
not in-place), and so on, but these only affect the n that fits into cache by a constant factor and hence do 
not impact cache-complexity analysis. We won’t worry about such constant factors in this section. 
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(implicitly including any cache sizes), as described in "Adaptive Composition of FFT Algo- 
rithms" (Section 11.5: Adaptive Composition of FFT Algorithms); the other is to exploit 
cache-oblivious algorithms. FF TW employs both of these techniques. 


The goal of cache-obliviousness is to structure the algorithm so that it exploits the cache 
without having the cache size as a parameter: the same code achieves the same asymptotic 
cache complexity regardless of the cache size Z. An optimal cache-oblivious algorithm 
achieves the optimal cache complexity (that is, in an asymptotic sense, ignoring constant 
factors). Remarkably, optimal cache-oblivious algorithms exist for many problems, such 
as matrix multiplication, sorting, transposition, and FFTs [137]. Not all cache-oblivious 
algorithms are optimal, of course—for example, the textbook radix-2 algorithm discussed 
above is “pessimal” cache-oblivious (its cache complexity is independent of Z because it 
always achieves the worst case!). 


For instance, Figure 11.2(right) and the algorithm of p. ?? shows a way to obliviously exploit 
the cache with a radix-2 Cooley-Tukey algorithm, by ordering the computation depth-first 
rather than breadth-first. That is, the DFT of size n is divided into two DFTs of size n/2, and 
one DFT of size n/2 is completely finished before doing any computations for the second 
DFT of size n/2. The two subtransforms are then combined using n/2 radix-2 butterflies, 
which requires a pass over the array and (hence n cache misses ifn > Z). This process is 
repeated recursively until a base-case (e.g. size 2) is reached. The cache complexity Q2 (n; Z) 
of this algorithm satisfies the recurrence 


n n<Z 
Qo (n;Z) = { 24 ae (11.4) 
2Q2 (n/2;Z) + O(n) otherwise 

The key property is this: once the recursion reaches a size n < Z, the subtransform fits 
into the cache and no further misses are incurred. The algorithm does not “know” this and 
continues subdividing the problem, of course, but all of those further subdivisions are in- 
cache because they are performed in the same depth-first branch of the tree. The solution 
of (11.4) is 


Q2(n; Z) = © (nlog|n/Z)). (11.5) 


This is worse than the theoretical optimum Q, (n; Z) from (11.3), but it is cache-oblivious 
(Z never entered the algorithm) and exploits at least some temporal locality.8 On the other 
hand, when it is combined with FFTW’s self-optimization and larger radices in "Adaptive 
Composition of FFT Algorithms" (Section 11.5: Adaptive Composition of FFT Algorithms), 
this algorithm actually performs very well until n becomes extremely large. By itself, how- 
ever, the algorithm of p. ?? must be modified to attain adequate performance for reasons 
that have nothing to do with the cache. These practical issues are discussed further in 
"Cache-obliviousness in practice" (Section 11.4.2: Cache-obliviousness in practice). 





®This advantage of depth-first. recursive implementation of the radix-2 FFT was pointed out many years 
ago by Singleton (where the “cache” was core memory) [341]. 
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There exists a different recursive FFT that is optimal cache-oblivious, however, and that 
is the radix-\/n “four-step” Cooley-Tukey algorithm (again executed recursively, depth-first) 
[137]. The cache complexity Q, of this algorithm satisfies the recurrence: 


n n<Z 
ee ae 11.6 
Q ( ) { 2/nQo (/n; i) +0 (n) otherwise 


That is, at each stage one performs \/n DFTs of size \/n (recursively), then multiplies by the 
0 (n) twiddle factors (and does a matrix transposition to obtain in-order output), then finally 
performs another \/n DFTs of size \/n. The solution of (11.6) is Q,(n;Z) = O(nlogzn), 
the same as the optimal cache complexity (11.3)! 


These algorithms illustrate the basic features of most optimal cache-oblivious algorithms: 
they employ a recursive divide-and-conquer strategy to subdivide the problem until it fits 
into cache, at which point the subdivision continues but no further cache misses are required. 
Moreover, a cache-oblivious algorithm exploits all levels of the cache in the same way, so an 
optimal cache-oblivious algorithm exploits a multi-level cache optimally as well as a two-level 
cache [137]: the multi-level “blocking” is implicit in the recursion. 


11.4.2 Cache-obliviousness in practice 


Even though the radix-,/n algorithm is optimal cache-oblivious, it does not follow that FFT 
implementation is a solved problem. The optimality is only in an asymptotic sense, ignoring 
constant factors, O(n) terms, etcetera, all of which can matter a great deal in practice. For 
small or moderate n, quite different algorithms may be superior, as discussed in "Memory 
strategies in FFTW" (Section 11.4.3: Memory strategies in FFTW). Moreover, real caches 
are inferior to an ideal cache in several ways. The unsurprising consequence of all this is 
that cache-obliviousness, like any complexity-based algorithm property, does not absolve 
one from the ordinary process of software optimization. At best, it reduces the amount of 
memory /cache tuning that one needs to perform, structuring the implementation to make 
further optimization easier and more portable. 


Perhaps most importantly, one needs to perform an optimization that has almost nothing to 
do with the caches: the recursion must be “coarsened” to amortize the function-call overhead 
and to enable compiler optimization. For example, the simple pedagogical code of the 
algorithm in p. ?? recurses all the way down to n = 1, and hence there are ~ 2n function calls 
in total, so that every data point incurs a two-function-call overhead on average. Moreover, 
the compiler cannot fully exploit the large register sets and instruction-level parallelism 
of modern processors with an n = 1 function body.? These problems can be effectively 
erased, however, simply by making the base cases larger, e.g. the recursion could stop when 





°In principle, it might be possible for a compiler to automatically coarsen the recursion, similar to how 
compilers can partially unroll loops. We are currently unaware of any general-purpose compiler that performs 
this optimization, however. 
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n = 32 is reached, at which point a highly optimized hard-coded FFT of that size would 
be executed. In FFTW, we produced this sort of large base-case using a specialized code- 
generation program described in "Generating Small FFT Kernels" (Section 11.6: Generating 
Small FFT Kernels). 


One might get the impression that there is a strict dichotomy that divides cache-aware 
and cache-oblivious algorithms, but the two are not mutually exclusive in practice. Given 
an implementation of a cache-oblivious strategy, one can further optimize it for the cache 
characteristics of a particular machine in order to improve the constant factors. For example, 
one can tune the radices used, the transition point between the radix-,/n algorithm and the 
bounded-radix algorithm, or other algorithmic choices as described in "Memory strategies 
in FFTW" (Section 11.4.3: Memory strategies in FFTW). The advantage of starting cache- 
aware tuning with a cache-oblivious approach is that the starting point already exploits all 
levels of the cache to some extent, and one has reason to hope that good performance on one 
machine will be more portable to other architectures than for a purely cache-aware “blocking” 
approach. In practice, we have found this combination to be very successful with FF TW. 


11.4.3 Memory strategies in FF TW 


The recursive cache-oblivious strategies described above form a useful starting point, but 
FFTW supplements them with a number of additional tricks, and also exploits cache- 
obliviousness in less-obvious forms. 


We currently find that the general radix-,/n algorithm is beneficial only when n becomes 
very large, on the order of 27° = 10°. In practice, this means that we use at most a single 
step of radix-,/n (two steps would only be used for n = 2*°). The reason for this is that 
the implementation of radix ,/n is less efficient than for a bounded radix: the latter has 
the advantage that an entire radix butterfly can be performed in hard-coded loop-free code 
within local variables/registers, including the necessary permutations and twiddle factors. 


Thus, for more moderate n, FF TW uses depth-first recursion with a bounded radix, similar 
in spirit to the algorithm of p. ?? but with much larger radices (radix 32 is common) and base 
cases (size 32 or 64 is common) as produced by the code generator of "Generating Small FFT 
Kernels" (Section 11.6: Generating Small FFT Kernels). The self-optimization described in 
"Adaptive Composition of FFT Algorithms" (Section 11.5: Adaptive Composition of FFT 
Algorithms) allows the choice of radix and the transition to the radix-,/n algorithm to be 
tuned in a cache-aware (but entirely automatic) fashion. 


For small n (including the radix butterflies and the base cases of the recursion), hard-coded 
FFTs (FFTW’s codelets) are employed. However, this gives rise to an interesting problem: 
a codelet for (e.g.) m = 64 is ~ 2000 lines long, with hundreds of variables and over 1000 
arithmetic operations that can be executed in many orders, so what order should be chosen? 
The key problem here is the efficient use of the CPU registers, which essentially form a nearly 
ideal, fully associative cache. Normally, one relies on the compiler for all code scheduling and 
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register allocation, but but the compiler needs help with such long blocks of code (indeed, the 
general register-allocation problem is NP-complete). In particular, FFTW’s generator knows 
more about the code than the compiler—the generator knows it is an FFT, and therefore 
it can use an optimal cache-oblivious schedule (analogous to the radix-,/n algorithm) to 
order the code independent of the number of registers [128]. The compiler is then used only 
for local “cache-aware” tuning (both for register allocation and the CPU pipeline).'° As a 
practical matter, one consequence of this scheduler is that FFTW’s machine-independent 
codelets are no slower than machine-specific codelets generated by an automated search 
and optimization over many possible codelet implementations, as performed by the SPIRAL 
project [420]. 


(When implementing hard-coded base cases, there is another choice because a loop of small 
transforms is always required. Is it better to implement a hard-coded FFT of size 64, for 
example, or an unrolled loop of four size-16 FFTs, both of which operate on the same amount 
of data? The former should be more efficient because it performs more computations with 
the same amount of data, thanks to the logn factor in the FFT’s nlogn complexity.) 


In addition, there are many other techniques that FF TW employs to supplement the basic 
recursive strategy, mainly to address the fact that cache implementations strongly favor ac- 
cessing consecutive data—thanks to cache lines, limited associativity, and direct mapping 
using low-order address bits (accessing data at power-of-two intervals in memory, which is dis- 
tressingly common in FFTs, is thus especially prone to cache-line conflicts). Unfortunately, 
the known FFT algorithms inherently involve some non-consecutive access (whether mixed 
with the computation or in separate bit-reversal/transposition stages). There are many op- 
timizations in FF TW to address this. For example, the data for several butterflies at a time 
can be copied to a small buffer before computing and then copied back, where the copies 
and computations involve more consecutive access than doing the computation directly in- 
place. Or, the input data for the subtransform can be copied from (discontiguous) input to 
(contiguous) output before performing the subtransform in-place (see "Indirect plans" (Sec- 
tion 11.5.2.4: Indirect plans)), rather than performing the subtransform directly out-of-place 
(as in algorithm 1 (p. ??)). Or, the order of loops can be interchanged in order to push the 
outermost loop from the first radix step [the @ loop in (11.2)] down to the leaves, in order 
to make the input access more consecutive (see "Discussion" (Section 11.5.2.6: Discussion)). 
Or, the twiddle factors can be computed using a smaller look-up table (fewer memory loads) 
at the cost of more arithmetic (see "Numerical Accuracy in FFTs" (Section 11.7: Numerical 
Accuracy in FFTs)). The choice of whether to use any of these techniques, which come 
into play mainly for moderate n (213 < n < 27°), is made by the self-optimizing planner as 
described in the next section. 





10One practical difficulty is that some “optimizing” compilers will tend to greatly re-order the code, de- 
stroying FFTW’s optimal schedule. With GNU gcc, we circumvent this problem by using compiler flags that 
explicitly disable certain stages of the optimizer. 
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11.5 Adaptive Composition of FFT Algorithms 


As alluded to several times already, FFTW implements a wide variety of FFT algorithms 
(mostly rearrangements of Cooley-Tukey) and selects the “best” algorithm for a given n 
automatically. In this section, we describe how such self-optimization is implemented, and 
especially how FFTW’s algorithms are structured as a composition of algorithmic fragments. 
These techniques in FFTW are described in greater detail elsewhere [134], so here we will 
focus only on the essential ideas and the motivations behind them. 


An FFT algorithm in FFTW is a composition of algorithmic steps called a plan. The 
algorithmic steps each solve a certain class of problems (either solving the problem directly 
or recursively breaking it into sub-problems of the same type). The choice of plan for a given 
problem is determined by a planner that selects a composition of steps, either by runtime 
measurements to pick the fastest algorithm, or by heuristics, or by loading a pre-computed 
plan. These three pieces: problems, algorithmic steps, and the planner, are discussed in the 
following subsections. 


11.5.1 The problem to be solved 


In early versions of FFTW, the only choice made by the planner was the sequence of radices 
[131], and so each step of the plan took a DFT of a given size n, possibly with discontiguous 
input/output, and reduced it (via a radix r) to DFTs of size n/r, which were solved recur- 
sively. That is, each step solved the following problem: given a size n, an input pointer I, 
an input stride v, an output pointer O, and an output stride o, it computed the DFT 
of I [¢:] for 0 < € < n and stored the result in O [ko] for 0 < k < n. However, we soon found 
that we could not easily express many interesting algorithms within this framework; for ex- 
ample, in-place (I = O) FFTs that do not require a separate bit-reversal stage [195], [375], 
[297], [166]. It became clear that the key issue was not the choice of algorithms, as we had 
first supposed, but the definition of the problem to be solved. Because only problems that 
can be expressed can be solved, the representation of a problem determines an outer bound 
to the space of plans that the planner can explore, and therefore it ultimately constrains 
FFTW’s performance. 


The difficulty with our initial (n,I,1,O,0) problem definition was that it forced each algo- 
rithmic step to address only a single DFT. In fact, FFTs break down DFTs into multiple 
smaller DFTs, and it is the combination of these smaller transforms that is best addressed 
by many algorithmic choices, especially to rearrange the order of memory accesses between 
the subtransforms. Therefore, we redefined our notion of a problem in FFTW to be not a 
single DFT, but rather a loop of DFTs, and in fact multiple nested loops of DFTs. The 
following sections describe some of the new algorithmic steps that such a problem definition 
enables, but first we will define the problem more precisely. 
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DFT problems in FFTW are expressed in terms of structures called I/O tensors,'! which 
in turn are described in terms of ancillary structures called I/O dimensions. An I/O di- 
mension d is a triple d = (n,1,0), where n is a non-negative integer called the length, is 
an integer called the input stride, and o is an integer called the output stride. An I/O 
tensor t = {d,do,...,d,} is a set of I/O dimensions. The non-negative integer p = |t| is 
called the rank of the I/O tensor. A DFT problem, denoted by dft(N, V,I, O), consists 
of two I/O tensors N and V, and of two pointers I and O. Informally, this describes |V| 
nested loops of |N|-dimensional DFTs with input data starting at memory location I and 
output data starting at O. 


For simplicity, let us consider only one-dimensional DFTs, so that N = {(n,v,0)} implies a 
DFT of length n on input data with stride v and output data with stride o, much like in the 
original FFTW as described above. The main new feature is then the addition of zero or 
more “loops” V. More formally, dft (N, {(n,v,0)} UV, I, O) is recursively defined as a “loop” 
of n problems: for all 0 < k < n, do all computations in dft(N,V,I+k-1,Q0+k-o). The 
case of multi-dimensional DFTs is defined more precisely elsewhere [134], but essentially each 
I/O dimension in N gives one dimension of the transform. 


We call N the size of the problem. The rank of a problem is defined to be the rank of 
its size (i.e., the dimensionality of the DFT). Similarly, we call V the vector size of the 
problem, and the vector rank of a problem is correspondingly defined to be the rank of 
its vector size. Intuitively, the vector size can be interpreted as a set of “loops” wrapped 
around a single DFT, and we therefore refer to a single I/O dimension of V as a vector 
loop. (Alternatively, one can view the problem as describing a DFT over a |V|-dimensional 
vector space.) The problem does not specify the order of execution of these loops, however, 
and therefore FFTW is free to choose the fastest or most convenient order. 


11.5.1.1 DFT problem examples 


A more detailed discussion of the space of problems in FFTW can be found in [134] , but 
a simple understanding can be gained by examining a few examples demonstrating that 
the I/O tensor representation is sufficiently general to cover many situations that arise in 
practice, including some that are not usually considered to be instances of the DFT. 


A single one-dimensional DFT of length n, with stride-1 input X and output Y, as in (11.1), 
is denoted by the problem dft ({(n, 1, 1)}, {},X, Y) (no loops: vector-rank zero). 


As a more complicated example, suppose we have an n, X ng matrix X stored as 
n, consecutive blocks of contiguous length-nz rows (this is called row-major format). 
The in-place DFT of all the rows of this matrix would be denoted by the prob- 
lem dft ({(me, 1,1)}, {(mi, na, n2)},X,X): a length-n, loop of size-ng contiguous DFTs, 





‘17/0 tensors are unrelated to the tensor-product notation used by some other authors to describe FFT 
algorithms [389], [296]. 
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where each iteration of the loop offsets its input/output data by a stride ng. Con- 
versely, the in-place DFT of all the columns of this matrix would be denoted by 
dft ({(m1, n2, n2)}, {(m2, 1,1)},X,X): compared to the previous example, N and V are 
swapped. In the latter case, each DFT operates on discontiguous data, and FFTW might 
well choose to interchange the loops: instead of performing a loop of DFTs computed indi- 
vidually, the subtransforms themselves could act on n2-component vectors, as described in 
"The space of plans in FFTW" (Section 11.5.2: The space of plans in FFTW). 


A size-1 DFT is simply a copy Y [0] = X [0], and here this can also be denoted by 
N = {} (rank zero, a “zero-dimensional” DFT). This allows FFTW’s problems to repre- 
sent many kinds of copies and permutations of the data within the same problem frame- 
work, which is convenient because these sorts of operations arise frequently in FFT al- 
gorithms. For example, to copy n consecutive numbers from I to O, one would use the 
rank-zero problem dft ({},{(n,1,1)},1,O). More interestingly, the in-place transpose 
of an ny X m2 matrix X stored in row-major format, as described above, is denoted by 
dft ({}, {(m1, na, 1) , (na, 1,1)}, X, X) (rank zero, vector-rank two). 


11.5.2 The space of plans in FF'TW 


Here, we describe a subset of the possible plans considered by FF'TW; while not exhaustive 
[134], this subset is enough to illustrate the basic structure of FFTW and the necessity of 
including the vector loop(s) in the problem definition to enable several interesting algorithms. 
The plans that we now describe usually perform some simple “atomic” operation, and it may 
not be apparent how these operations fit together to actually compute DFTs, or why certain 
operations are useful at all. We shall discuss those matters in "Discussion" (Section 11.5.2.6: 
Discussion). 


Roughly speaking, to solve a general DFT problem, one must perform three tasks. First, one 
must reduce a problem of arbitrary vector rank to a set of loops nested around a problem 
of vector rank 0, i.e., a single (possibly multi-dimensional) DFT. Second, one must reduce 
the multi-dimensional DFT to a sequence of of rank-1 problems, i.e., one-dimensional DFTs; 
for simplicity, however, we do not consider multi-dimensional DFTs below. Third, one must 
solve the rank-1, vector rank-0 problem by means of some DFT algorithm such as Cooley- 
Tukey. These three steps need not be executed in the stated order, however, and in fact, 
almost every permutation and interleaving of these three steps leads to a correct DFT plan. 
The choice of the set of plans explored by the planner is critical for the usability of the 
FFTW system: the set must be large enough to contain the fastest possible plans, but it 
must be small enough to keep the planning time acceptable. 


11.5.2.1 Rank-O plans 


The rank-0 problem dft ({}, V, I, O) denotes a permutation of the input array into the output 
array. FFTW does not solve arbitrary rank-0 problems, only the following two special cases 
that arise in practice. 
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e When |V| = 1 andI40O, FFTW produces a plan that copies the input array into the 
output array. Depending on the strides, the plan consists of a loop or, possibly, of a 
call to the ANSI C function memcpy, which is specialized to copy contiguous regions of 
memory. 

e When |V| = 2, I =O, and the strides denote a matrix-transposition problem, FFTW 
creates a plan that transposes the array in-place. FF TW implements the square trans- 
position dft ({}, {(n,v, 0), (n,o0,+)},1,O0) by means of the cache-oblivious algorithm 
from [137], which is fast and, in theory, uses the cache optimally regardless of the 
cache size (using principles similar to those described in the section "FFTs and the 
Memory Hierarchy" (Section 11.4: FFTs and the Memory Hierarchy)). A generaliza- 
tion of this idea is employed for non-square transpositions with a large common factor 
or a small difference between the dimensions, adapting algorithms from [100]. 


11.5.2.2 Rank-1 plans 


Rank-1 DFT problems denote ordinary one-dimensional Fourier transforms. FFTW deals 
with most rank-1 problems as follows. 


11.5.2.2.1 Direct plans 


When the DFT rank-1 problem is “small enough” (usually, n < 64), FFTW produces a direct 
plan that solves the problem directly. These plans operate by calling a fragment of C code (a 
codelet) specialized to solve problems of one particular size, whose generation is described 
in "Generating Small FFT Kernels" (Section 11.6: Generating Small FFT Kernels). More 
precisely, the codelets compute a loop ([V| < 1) of small DFTs. 


11.5.2.2.2 Cooley-Tukey plans 


For problems of the form dft ({(n,v,0)},V,I,O) where n = rm, FFTW generates a plan 
that implements a radix-r Cooley-Tukey algorithm "Review of the Cooley-Tukey FFT" (Sec- 
tion 11.2: Review of the Cooley-Tukey FFT). Both decimation-in-time and decimation-in- 
frequency plans are supported, with both small fixed radices (usually, r < 64) produced 
by the codelet generator "Generating Small FFT Kernels" (Section 11.6: Generating Small 
FFT Kernels) and also arbitrary radices (e.g. radix-y\/7n). 


The most common case is a decimation in time (DIT) plan, corresponding to a radix r = 
m2 (and thus m = n;) in the notation of "Review of the Cooley-Tukey FFT" (Section 11.2: 
Review of the Cooley-Tukey FFT): it first solves dft ({(m,r-t,0)}, VU {(r,2,m- 0)},1,O), 
then multiplies the output array O by the twiddle factors, and finally solves 
dft ({(r,m-o,m-o)}, VU {(m, 0,0)},O0,O). For performance, the last two steps are not 
planned independently, but are fused together in a single “twiddle” codelet—a fragment of C 
code that multiplies its input by the twiddle factors and performs a DFT of size r, operating 
in-place on O. 
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11.5.2.3 Plans for higher vector ranks 


These plans extract a vector loop to reduce a DFT problem to a problem of lower vector 
rank, which is then solved recursively. Any of the vector loops of V could be extracted in 
this way, leading to a number of possible plans corresponding to different loop orderings. 


Formally, to solve dft (N, V,I,O), where V = {(n,v,0)}UVi, FFTW generates a loop that, 
for all k such that 0 < k < n, invokes a plan for dft(N,Vi,I+k-1,O0+k- 0). 


11.5.2.4 Indirect plans 


Indirect plans transform a DFT problem that requires some data shuffling (or discontiguous 
operation) into a problem that requires no shuffling plus a rank-0 problem that performs the 
shuffling. 


Formally, to solve dft(N, V,I,O) where |N| > 0, FFTW generates a plan that first solves 
dft ({}, NUV,I, O), and then solves dft (copy — o(N) , copy — 0o(V),O,O). Here we define 
copy — o(t) to be the I/O tensor {(n,0,0) | (n,v,0) € t}: that is, it replaces the input 
strides with the output strides. Thus, an indirect plan first rearranges/copies the data to 
the output, then solves the problem in place. 


11.5.2.5 Plans for prime sizes 


As discussed in "Goals and Background of the FFTW Project" (Section 11.3: Goals and 
Background of the FFTW Project), it turns out to be surprisingly useful to be able to 
handle large prime n (or large prime factors). Rader plans implement the algorithm from 
[309] to compute one-dimensional DFTs of prime size in © (nlogn) time. Bluestein plans 
implement Bluestein’s “chirp-z” algorithm, which can also handle prime n in O (nlogn) time 
[35], [305], [278]. Generic plans implement a naive O(n”) algorithm (useful for n < 100). 


11.5.2.6 Discussion 


Although it may not be immediately apparent, the combination of the recursive rules in 
"The space of plans in FFTW" (Section 11.5.2: The space of plans in FFTW) can produce 
a number of useful algorithms. To illustrate these compositions, we discuss three particular 
issues: depth- vs. breadth-first, loop reordering, and in-place transforms. 
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size-30 DFT, depth-first: 


loop 3 
size-5 direct codelet, vector size 2 
{ size-2 twiddle codelet, vector size 5 
size-3 twiddle codelet, vector size 10 


size-30 DFT, breadth-first: 
( { loop 3 


size-5 direct codelet, vector size 2 


loop 3 
size-2 twiddle codelet, vector size 5 
size-3 twiddle codelet, vector size 10 


Figure 11.3: Two possible decompositions for a size-30 DFT, both for the arbitrary 
choice of DIT radices 3 then 2 then 5, and prime-size codelets. Items grouped by a "{" 
result from the plan for a single sub-problem. In the depth-first case, the vector rank was 
reduced to zero as per "Plans for higher vector ranks" (Section 11.5.2.3: Plans for higher 
vector ranks) before decomposing sub-problems, and vice-versa in the breadth-first case. 





As discussed previously in sections "Review of the Cooley-Tukey FFT" (Section 11.2: Review 
of the Cooley-Tukey FFT) and "Understanding FFTs with an ideal cache" (Section 11.4.1: 
Understanding FFTs with an ideal cache), the same Cooley-Tukey decomposition can be 
executed in either traditional breadth-first order or in recursive depth-first order, where the 
latter has some theoretical cache advantages. FFTW is explicitly recursive, and thus it 
can naturally employ a depth-first order. Because its sub-problems contain a vector loop 
that can be executed in a variety of orders, however, FFTW can also employ breadth-first 
traversal. In particular, a 1d algorithm resembling the traditional breadth-first Cooley-Tukey 
would result from applying "Cooley-Tukey plans" (Section 11.5.2.2.2: Cooley-Tukey plans) 
to completely factorize the problem size before applying the loop rule "Plans for higher 
vector ranks" (Section 11.5.2.3: Plans for higher vector ranks) to reduce the vector ranks, 
whereas depth-first traversal would result from applying the loop rule before factorizing each 
subtransform. These two possibilities are illustrated by an example in Figure 11.3. 


Another example of the effect of loop reordering is a style of plan that we sometimes call 
vector recursion (unrelated to “vector-radix” FFTs [114]). The basic idea is that, if one has 
a loop (vector-rank 1) of transforms, where the vector stride is smaller than the transform 
size, it is advantageous to push the loop towards the leaves of the transform decomposition, 
while otherwise maintaining recursive depth-first ordering, rather than looping “outside” 
the transform; i.e., apply the usual FFT to “vectors” rather than numbers. Limited forms 
of this idea have appeared for computing multiple FFTs on vector processors (where the 
loop in question maps directly to a hardware vector) [372]. For example, Cooley-Tukey 
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produces a unit input-stride vector loop at the top-level DIT decomposition, but with a 
large output stride; this difference in strides makes it non-obvious whether vector recursion 
is advantageous for the sub-problem, but for large transforms we often observe the planner 
to choose this possibility. 


In-place 1d transforms (with no separate bit reversal pass) can be obtained as follows by a 
combination DIT and DIF plans "Cooley-Tukey plans" (Section 11.5.2.2.2: Cooley-Tukey 
plans) with transposes "Rank-0 plans" (Section 11.5.2.1: Rank-0 plans). First, the transform 
is decomposed via a radix-p DIT plan into a vector of p transforms of size gm, then these 
are decomposed in turn by a radix-q DIF plan into a vector (rank 2) of p x q transforms of 
size m. These transforms of size m have input and output at different places/strides in the 
original array, and so cannot be solved independently. Instead, an indirect plan "Indirect 
plans" (Section 11.5.2.4: Indirect plans) is used to express the sub-problem as pq in-place 
transforms of size m, followed or preceded by an m x p x q rank-0 transform. The latter 
sub-problem is easily seen to be m in-place p x q transposes (ideally square, i.e. p = q). 
Related strategies for in-place transforms based on small transposes were described in [195], 
[375], [297], [166]; alternating DIT/DIF, without concern for in-place operation, was also 
considered in [255], [322]. 


11.5.3 The FFTW planner 


Given a problem and a set of possible plans, the basic principle behind the FFTW planner 
is straightforward: construct a plan for each applicable algorithmic step, time the execution 
of these plans, and select the fastest one. Each algorithmic step may break the problem 
into subproblems, and the fastest plan for each subproblem is constructed in the same way. 
These timing measurements can either be performed at runtime, or alternatively the plans 
for a given set of sizes can be precomputed and loaded at a later time. 


A direct implementation of this approach, however, faces an exponential explosion of the 
number of possible plans, and hence of the planning time, as n increases. In order to 
reduce the planning time to a manageable level, we employ several heuristics to reduce the 
space of possible plans that must be compared. The most important of these heuristics is 
dynamic programming [96]: it optimizes each sub-problem locally, independently of the 
larger context (so that the “best” plan for a given sub-problem is re-used whenever that sub- 
problem is encountered). Dynamic programming is not guaranteed to find the fastest plan, 
because the performance of plans is context-dependent on real machines (e.g., the contents 
of the cache depend on the preceding computations); however, this approximation works 
reasonably well in practice and greatly reduces the planning time. Other approximations, 
such as restrictions on the types of loop-reorderings that are considered "Plans for higher 
vector ranks" (Section 11.5.2.3: Plans for higher vector ranks), are described in [134]. 


Alternatively, there is an estimate mode that performs no timing measurements whatso- 
ever, but instead minimizes a heuristic cost function. This can reduce the planner time by 
several orders of magnitude, but with a significant penalty observed in plan efficiency; e.g., 
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a penalty of 20% is typical for moderate n < 2!°, whereas a factor of 2-3 can be suffered for 
large n = 2!° [134]. Coming up with a better heuristic plan is an interesting open research 
question; one difficulty is that, because FFT algorithms depend on factorization, knowing a 
good plan for n does not immediately help one find a good plan for nearby n. 


11.6 Generating Small FFT Kernels 


The base cases of FFTW’s recursive plans are its codelets, and these form a critical compo- 
nent of FFTW’s performance. They consist of long blocks of highly optimized, straight-line 
code, implementing many special cases of the DFT that give the planner a large space of 
plans in which to optimize. Not only was it impractical to write numerous codelets by hand, 
but we also needed to rewrite them many times in order to explore different algorithms and 
optimizations. Thus, we designed a special-purpose “FFT compiler” called genfft that pro- 
duces the codelets automatically from an abstract description. genfft is summarized in this 
section and described in more detail by [128]. 


A typical codelet in FFTW computes a DFT of a small, fixed size n (usually, n < 64), 
possibly with the input or output multiplied by twiddle factors "Cooley-Tukey plans" (Sec- 
tion 11.5.2.2.2: Cooley-Tukey plans). Several other kinds of codelets can be produced by 
genfft , but we will focus here on this common case. 


In principle, all codelets implement some combination of the Cooley-Tukey algorithm from 
(11.2) and/or some other DFT algorithm expressed by a similarly compact formula. However, 
a high-performance implementation of the DFT must address many more concerns than 
(11.2) alone suggests. For example, (11.2) contains multiplications by 1 that are more 
efficient to omit. (11.2) entails a run-time factorization of n, which can be precomputed 
if n is known in advance. (11.2) operates on complex numbers, but breaking the complex- 
number abstraction into real and imaginary components turns out to expose certain non- 
obvious optimizations. Additionally, to exploit the long pipelines in current processors, the 
recursion implicit in (11.2) should be unrolled and re-ordered to a significant degree. Many 
further optimizations are possible if the complex input is known in advance to be purely 
real (or imaginary). Our design goal for genfft was to keep the expression of the DFT 
algorithm independent of such concerns. This separation allowed us to experiment with 
various DFT algorithms and implementation strategies independently and without (much) 
tedious rewriting. 


genfft is structured as a compiler whose input consists of the kind and size of the desired 
codelet, and whose output is C code. genfft operates in four phases: creation, simplification, 
scheduling, and unparsing. 


In the creation phase, genfft produces a representation of the codelet in the form of a di- 
rected acyclic graph (dag). The dag is produced according to well-known DFT algorithms: 
Cooley-Tukey (11.2), prime-factor [278], split-radix [422], [107], [391], [230], [114], and Rader 
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[309]. Each algorithm is expressed in a straightforward math-like notation, using complex 
numbers, with no attempt at optimization. Unlike a normal FFT implementation, however, 
the algorithms here are evaluated symbolically and the resulting symbolic expression is rep- 
resented as a dag, and in particular it can be viewed as a linear network [98] (in which 
the edges represent multiplication by constants and the vertices represent additions of the 
incoming edges). 


In the simplification phase, genfft applies local rewriting rules to each node of the dag 
in order to simplify it. This phase performs algebraic transformations (such as eliminating 
multiplications by 1) and common-subexpression elimination. Although such transforma- 
tions can be performed by a conventional compiler to some degree, they can be carried 
out here to a greater extent because genfft can exploit the specific problem domain. For 
example, two equivalent subexpressions can always be detected, even if the subexpressions 
are written in algebraically different forms, because all subexpressions compute linear func- 
tions. Also, genfft can exploit the property that network transposition (reversing the 
direction of every edge) computes the transposed linear operation [98], in order to transpose 
the network, simplify, and then transpose back—this turns out to expose additional com- 
mon subexpressions [128]. In total, these simplifications are sufficiently powerful to derive 
DFT algorithms specialized for real and/or symmetric data automatically from the complex 
algorithms. For example, it is known that when the input of a DFT is real (and the output 
is hence conjugate-symmetric), one can save a little over a factor of two in arithmetic cost 
by specializing FFT algorithms for this case—with genfft , this specialization can be done 
entirely automatically, pruning the redundant operations from the dag, to match the lowest 
known operation count for a real-input FFT starting only from the complex-data algorithm 
[128], [202]. We take advantage of this property to help us implement real-data DFTs [128], 
[134], to exploit machine-specific “SIMD” instructions "SIMD instructions" (Section 11.6.1: 
SIMD instructions) [134], and to generate codelets for the discrete cosine (DCT) and sine 
(DST) transforms [128], [202]. Furthermore, by experimentation we have discovered addi- 
tional simplifications that improve the speed of the generated code. One interesting example 
is the elimination of negative constants [128]: multiplicative constants in FFT algorithms 
often come in positive/negative pairs, but every C compiler we are aware of will generate 
separate load instructions for positive and negative versions of the same constants.'2 We 
thus obtained a 10-15% speedup by making all constants positive, which involves propagat- 
ing minus signs to change additions into subtractions or vice versa elsewhere in the dag (a 
daunting task if it had to be done manually for tens of thousands of lines of code). 


In the scheduling phase, genfft produces a topological sort of the dag (a schedule). The 
goal of this phase is to find a schedule such that a C compiler can subsequently perform a 
good register allocation. The scheduling algorithm used by genfft offers certain theoretical 
guarantees because it has its foundations in the theory of cache-oblivious algorithms [137| 
(here, the registers are viewed as a form of cache), as described in "Memory strategies in 





!2Floating-point constants must be stored explicitly in memory; they cannot be embedded directly into 
the CPU instructions like integer “immediate” constants. 
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FFTW" (Section 11.4.3: Memory strategies in FFTW). As a practical matter, one conse- 
quence of this scheduler is that FFTW’s machine-independent codelets are no slower than 
machine-specific codelets generated by SPIRAL [420]. 


In the stock genfft implementation, the schedule is finally unparsed to C. A variation from 
[127] implements the rest of a compiler back end and outputs assembly code. 


11.6.1 SIMD instructions 


Unfortunately, it is impossible to attain nearly peak performance on current popular pro- 
cessors while using only portable C code. Instead, a significant portion of the available 
computing power can only be accessed by using specialized SIMD (single-instruction multi- 
ple data) instructions, which perform the same operation in parallel on a data vector. For 
example, all modern “x86” processors can execute arithmetic instructions on “vectors” of four 
single-precision values (SSE instructions) or two double-precision values (SSE2 instructions) 
at a time, assuming that the operands are arranged consecutively in memory and satisfy 
a 16-byte alignment constraint. Fortunately, because nearly all of FFTW’s low-level code 
is produced by genfft , machine-specific instructions could be exploited by modifying the 
generator—the improvements are then automatically propagated to all of FFTW’s codelets, 
and in particular are not limited to a small set of sizes such as powers of two. 


SIMD instructions are superficially similar to “vector processors”, which are designed to per- 
form the same operation in parallel on an all elements of a data array (a “vector’). The 
performance of “traditional” vector processors was best for long vectors that are stored in 
contiguous memory locations, and special algorithms were developed to implement the DFT 
efficiently on this kind of hardware [372], [166]. Unlike in vector processors, however, the 
SIMD vector length is small and fixed (usually 2 or 4). Because microprocessors depend 
on caches for performance, one cannot naively use SIMD instructions to simulate a long- 
vector algorithm: while on vector machines long vectors generally yield better performance, 
the performance of a microprocessor drops as soon as the data vectors exceed the capac- 
ity of the cache. Consequently, SIMD instructions are better seen as a restricted form of 
instruction-level parallelism than as a degenerate flavor of vector parallelism, and different 
DFT algorithms are required. 


The technique used to exploit SIMD instructions in genfft is most easily understood for 
vectors of length two (e.g., SSE2). In this case, we view a complex DFT as a pair of real 
DFTs: 


DFT(A+i-B)=DFT(A)+i-DFT(B) , (11.7) 


where A and B are two real arrays. Our algorithm computes the two real DFTs in parallel 
using SIMD instructions, and then it combines the two outputs according to (11.7). This 
SIMD algorithm has two important properties. First, if the data is stored as an array of 
complex numbers, as opposed to two separate real and imaginary arrays, the SIMD loads 
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and stores always operate on correctly-aligned contiguous locations, even if the the complex 
numbers themselves have a non-unit stride. Second, because the algorithm finds two-way 
parallelism in the real and imaginary parts of a single DFT (as opposed to performing two 
DFTs in parallel), we can completely parallelize DFTs of any size, not just even sizes or 
powers of 2. 


11.7 Numerical Accuracy in FFTs 


An important consideration in the implementation of any practical numerical algorithm is 
numerical accuracy: how quickly do floating-point roundoff errors accumulate in the course 
of the computation? Fortunately, FFT algorithms for the most part have remarkably good 
accuracy characteristics. In particular, for a DFT of length n computed by a Cooley-Tukey 
algorithm with finite-precision floating-point arithmetic, the worst-case error growth is 
O (logn) [139], [373] and the mean error growth for random inputs is only O (/logn) [326], 
[373]. This is so good that, in practical applications, a properly implemented FFT will rarely 
be a significant contributor to the numerical error. 


The amazingly small roundoff errors of FFT algorithms are sometimes explained incorrectly 
as simply a consequence of the reduced number of operations: since there are fewer operations 
compared to a naive O(n?) algorithm, the argument goes, there is less accumulation of 
roundoff error. The real reason, however, is more subtle than that, and has to do with the 
ordering of the operations rather than their number. For example, consider the computation 
of only the output Y [0] in the radix-2 algorithm of p. ??, ignoring all of the other outputs of 
the FFT. Y [0] is the sum of all of the inputs, requiring n — 1 additions. The FFT does not 
change this requirement, it merely changes the order of the additions so as to re-use some 
of them for other outputs. In particular, this radix-2 DIT FFT computes Y [0] as follows: 
it first sums the even-indexed inputs, then sums the odd-indexed inputs, then adds the two 
sums; the even- and odd-indexed inputs are summed recursively by the same procedure. 
This process is sometimes called cascade summation, and even though it still requires 
n —1 total additions to compute Y [0] by itself, its roundoff error grows much more slowly 
than simply adding X [0], X [1], X [2] and so on in sequence. Specifically, the roundoff error 
when adding up n floating-point numbers in sequence grows as O(n) in the worst case, 
or as O(,/n) on average for random inputs (where the errors grow according to a random 
walk), but simply reordering these n-1 additions into a cascade summation yields O (logn) 
worst-case and O (/logn) average-case error growth [182]. 


However, these encouraging error-growth rates only apply if the trigonometric “twiddle” 
factors in the FFT algorithm are computed very accurately. Many FFT implementations, 
including FFTW and common manufacturer-optimized libraries, therefore use precomputed 
tables of twiddle factors calculated by means of standard library functions (which compute 
trigonometric constants to roughly machine precision). The other common method to com- 
pute twiddle factors is to use a trigonometric recurrence formula—this saves memory (and 
cache), but almost all recurrences have errors that grow as O(./n), O(n), or even O(n”) 
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[374], which lead to corresponding errors in the FFT. For example, one simple recurrence is 
ellk+1) — ete multiplying repeatedly by e’® to obtain a sequence of equally spaced angles, 
but the errors when using this process grow as O (n) [374]. A common improved recurrence 
is eM(AFNO — ik? + eik9 (ec _— 1) where the small quantity’? e" — 1 = cos (0) — 1+ isin (8) is 
computed using cos (0) — 1 = —2sin? (0/2) [341]; unfortunately, the error using this method 
still grows as O (,/n) [374], far worse than logarithmic. 


There are, in fact, trigonometric recurrences with the same logarithmic error growth as the 
FFT, but these seem more difficult to implement efficiently; they require that a table of 
Q (logn) values be stored and updated as the recurrence progresses [42], [374]. Instead, in 
order to gain at least some of the benefits of a trigonometric recurrence (reduced memory 
pressure at the expense of more arithmetic), FFTW includes several ways to compute a much 
smaller twiddle table, from which the desired entries can be computed accurately on the fly 
using a bounded number (usually < 3) of complex multiplications. For example, instead of 
a twiddle table with n entries w*, FFTW can use two tables with © (,/n) entries each, so 
that w* is computed by multiplying an entry in one table (indexed with the low-order bits 
of k) by an entry in the other table (indexed with the high-order bits of k). 


There are a few non-Cooley-Tukey algorithms that are known to have worse error charac- 
teristics, such as the “real-factor” algorithm [313], [114], but these are rarely used in practice 
(and are not used at all in FFTW). On the other hand, some commonly used algorithms for 
type-I and type-IV discrete cosine transforms [372], [290], [73] have errors that we observed 
to grow as \/n even for accurate trigonometric constants (although we are not aware of any 
theoretical error analysis of these algorithms), and thus we were forced to use alternative 
algorithms [134]. 


To measure the accuracy of FFTW, we compare against a slow FFT implemented in 
arbitrary-precision arithmetic, while to verify the correctness we have found the O (nlogn) 
self-test algorithm of [122] very useful. 


11.8 Concluding Remarks 


It is unlikely that many readers of this chapter will ever have to implement their own fast 
Fourier transform software, except as a learning exercise. The computation of the DFT, 
much like basic linear algebra or integration of ordinary differential equations, is so central to 
numerical computing and so well-established that robust, flexible, highly optimized libraries 
are widely available, for the most part as free/open-source software. And yet there are 
many other problems for which the algorithms are not so finalized, or for which algorithms 
are published but the implementations are unavailable or of poor quality. Whatever new 
problems one comes across, there is a good chance that the chasm between theory and 
efficient implementation will be just as large as it is for FFTs, unless computers become 





13Tn an FFT, the twiddle factors are powers of wn, so @ is a small angle proportional to 1/n and e’® is 
close to 1. 
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much simpler in the future. For readers who encounter such a problem, we hope that these 
lessons from FF TW will be useful: 


Generality and portability should almost always come first. 

The number of operations, up to a constant factor, is less important than the order of 
operations. 

Recursive algorithms with large base cases make optimization easier. 

Optimization, like any tedious task, is best automated. 

Code generation reconciles high-level programming with low-level performance. 


We should also mention one final lesson that we haven’t discussed in this chapter: you can’t 
optimize in a vacuum, or you end up congratulating yourself for making a slow program 
slightly faster. We started the FFTW project after downloading a dozen FFT implementa- 
tions, benchmarking them on a few machines, and noting how the winners varied between 
machines and between transform sizes. Throughout FFTW’s development, we continued to 
benefit from repeated benchmarks against the dozens of high-quality FFT programs available 
online, without which we would have thought FFTW was “complete” long ago. 
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Algorithms for Data with Restrictions’ 


12.1 Algorithms for Real Data 


Many applications involve processing real data. It is inefficient to simply use a complex FFT 
on real data because arithmetic would be performed on the zero imaginary parts of the input, 
and, because of symmetries, output values would be calculated that are redundant. There 
are several approaches to developing special algorithms or to modifying complex algorithms 
for real data. 


There are two methods which use a complex FFT in a special way to increase efficiency [39], 
[359]. The first method uses a length-N complex FFT to compute two length-N real FFTs 
by putting the two real data sequences into the real and the imaginary parts of the input 
to a complex FFT. Because transforms of real data have even real parts and odd imaginary 
parts, it is possible to separate the transforms of the two inputs with 2N-4 extra additions. 
This method requires, however, that two inputs be available at the same time. 


The second method [359] uses the fact that the last stage of a decimation-in-time radix-2 
FFT combines two independent transforms of length N/2 to compute a length-N transform. 
If the data are real, the two half length transforms are calculated by the method described 
above and the last stage is carried out to calculate the total length-N FFT of the real data. 
It should be noted that the half-length FFT does not have to be calculated by a radix-2 
FFT. In fact, it should be calculated by the most efficient complex-data algorithm possible, 
such as the SRFFT or the PFA. The separation of the two half-length transforms and the 
computation of the last stage requires N —6 real multiplications and (5/2) N—6 real additions 
[359]. 


It is possible to derive more efficient real-data algorithms directly rather than using a complex 
FFT. The basic idea is from Bergland [21], [22] and Sande [325] which, at each stage, uses 
the symmetries of a constant radix Cooley-Tukey FFT to minimize arithmetic and storage. 
In the usual derivation [275] of the radix-2 FFT, the length-N transform is written as the 
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CHAPTER 12. ALGORITHMS FOR DATA WITH 
RESTRICTIONS 


combination of the length-N/2 DFT of the even indexed data and the length-N/2 DFT 
of the odd indexed data. If the input to each half-length DFT is real, the output will 
have Hermitian symmetry. Hence the output of each stage can be arranged so that the 
results of that stage stores the complex DFT with the real part located where half of the 
DFT would have gone, and the imaginary part located where the conjugate would have 
gone. This removes most of the redundant calculations and storage but slightly complicates 
the addressing. The resulting butterfly structure for this algorithm [359] resembles that 
for the fast Hartley transform [353]. The complete algorithm has one half the number of 
multiplications and N-2 fewer than half the additions of the basic complex FFT. Applying 
this approach to the split-radix FFT gives a particularly interesting algorithm [103], [359], 
[111]. 
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Special versions of both the PFA and WFTA can also be developed for real data. Because the 
operations in the stages of the PFA can be commuted, it is possible to move the combination 
of the transform of the real part of the input and imaginary part to the last stage. Because 
the imaginary part of the input is zero, half of the algorithm is simply omitted. This results 
in the number of multiplications required for the real transform being exactly half of that 
required for complex data and the number of additions being about N less than half that 
required for the complex case because adding a pure real number to a pure imaginary number 
does not require an actual addition. Unfortunately, the indexing and data transfer becomes 
somewhat more complicated [179], [359]. A similar approach can be taken with the WFTA 
[179], [359], [284]. 


12.2 Special Algorithms for input Data that is mostly Zero, 
for Calculating only a few Outputs, or where the Sampling 
is not Uniform 


In some cases, most of the data to be transformed are zero. It is clearly wasteful to do 
arithmetic on that zero data. Another special case is when only a few DFT values are 
needed. It is likewise wasteful to calculate outputs that are not needed. We use a process 
called “pruning" to remove the unneeded operations. 


In other cases, the data are non-uniform sampling of a continuous time signal [13]. 


12.3 Algorithms for Approximate DFTs 


There are applications where approximations to the DFT are all that is needed.[{161], [163] 
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Chapter 13 


Convolution Algorithms' 


13.1 Fast Convolution by the FFT 


One of the main applications of the FFT is to do convolution more efficiently than the direct 
calculation from the definition which is: 


vin) = S- h(m) x(n —m) (13:1) 
which, with a change of variables, can also be written as: 
y(n) = 5° «(m) h(n—m) (13.2) 


This is often used to filter a signal x (n) with a filter whose impulse response is h(n). Each 
output value y(n) requires N multiplications and N — 1 additions if y(n) and h(n) have N 
terms. So, for N output values, on the order of N? arithmetic operations are required. 


Because the DFT converts convolution to multiplication: 


DFT{y(n)} = DFT {h(n)} DFT {2 (n)} (13.3) 


can be calculated with the FFT and bring the order of arithmetic operations down to 
Nlog(N) which can be significant for large N. 


This approach, which is called “fast convolutions", is a form of block processing since a whole 
block or segment of x(n) must be available to calculate even one output value, y(n). So, 
a time delay of one block length is always required. Another problem is the filtering use of 
convolution is usually non-cyclic and the convolution implemented with the DFT is cyclic. 
This is dealt with by appending zeros to x(n) and h(n) such that the output of the cyclic 
convolution gives one block of the output of the desired non-cyclic convolution. 


For filtering and some other applications, one wants “on going" convolution where the filter 
response h (n) may be finite in length or duration, but the input x (n) is of arbitrary length. 
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Two methods have traditionally used to break the input into blocks and use the FFT to con- 
volve the block so that the output that would have been calculated by directly implementing 
(13.1) or (13.2) can be constructed efficiently. These are called “overlap-add" and “over-lap 
save". 


13.1.1 Fast Convolution by Overlap-Add 


In order to use the FFT to convolve (or filter) a long input sequence x(n) with a finite 
length-M impulse response, h(n), we partition the input sequence in segments or blocks of 
length L. Because convolution (or filtering) is linear, the output is a linear sum of the result 
of convolving the first block with h(n) plus the result of convolving the second block with 
h(n), plus the rest. Each of these block convolutions can be calculated by using the FFT. 
The output is the inverse FFT of the product of the FFT of a(n) and the FFT of h(n). 
Since the number of arithmetic operation to calculate the convolution directly is on the order 
of M? and, if done with the FFT, is on the order of Mlog (M), there can be a great savings 
by using the FFT for large M. 


The reason this procedure is not totally straightforward, is the length of the output of 
convolving a length-L block with a length-M filter is of length L + M—1. This means the 
output blocks cannot simply be concatenated but must be overlapped and added, hence the 
name for this algorithm is “Overlap-Add". 


The second issue that must be taken into account is the fact that the overlap-add steps 
need non-cyclic convolution and convolution by the FFT is cyclic. This is easily handled by 
appending L — 1 zeros to the impulse response and M — 1 zeros to each input block so that 
all FFTs are of length M+ L£-—1. This means there is no aliasing and the implemented 
cyclic convolution gives the same output as the desired non-cyclic convolution. 


The savings in arithmetic can be considerable when implementing convolution or performing 
FIR digital filtering. However, there are two penalties. The use of blocks introduces a delay 
of one block length. None of the first block of output can be calculated until all of the first 
block of input is available. This is not a problem for “off line" or “batch" processing but can 
be serious for real-time processing. The second penalty is the memory required to store and 
process the blocks. The continuing reduction of memory cost often removes this problem. 


The efficiency in terms of number of arithmetic operations per output point increases for 
large blocks because of the Mlog (M) requirements of the FFT. However, the blocks become 
very large (L > > M), much of the input block will be the appended zeros and efficiency 
is lost. For any particular application, taking the particular filter and FFT algorithm being 
used and the particular hardware being used, a plot of efficiency vs. block length, L should 
be made and L chosen to maximize efficiency given any other constraints that are applicable. 


Usually, the block convolutions are done by the FFT, but they could be done by any ef- 
ficient, finite length method. One could use “rectangular transforms" or “number-theoretic 
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transforms". A generalization of this method is presented later in the notes. 


13.1.2 Fast Convolution by Overlap-Save 


An alternative approach to the Overlap-Add can be developed by starting with segmenting 
the output rather than the input. If one considers the calculation of a block of output, it 
is seen that not only the corresponding input block is needed, but part of the preceding 
input block also needed. Indeed, one can show that a length M+ L—1 segment of the 
input is needed for each output block. So, one saves the last part of the preceding block and 
concatenates it with the current input block, then convolves that with h(n) to calculate the 
current output 


13.2 Block Processing, a Generalization of Overlap Meth- 
ods 


Convolution is intimately related to the DFT. It was shown in The DFT as Convolution or 
Filtering (Chapter 5) that a prime length DFT could be converted to cyclic convolution. It 
has been long known [276] that convolution can be calculated by multiplying the DFTs of 
signals. 


An important question is what is the fastest method for calculating digital convolution. 
There are several methods that each have some advantage. The earliest method for fast 
convolution was the use of sectioning with overlap-add or overlap-save and the FFT [276], 
[300], [66]. In most cases the convolution is of real data and, therefore, real-data FFTs 
should be used. That approach is still probably the fastest method for longer convolution 
on a general purpose computer or microprocessor. The shorter convolutions should simply 
be calculated directly. 


13.3 Introduction 


The partitioning of long or infinite strings of data into shorter sections or blocks has been 
used to allow application of the FFT to realize on-going or continuous convolution [368], 
[181]. This section develops the idea of block processing and shows that it is a generalization 
of the overlap-add and overlap-save methods [368], [147]. They further generalize the idea to 
a multidimensional formulation of convolution [3], [47]. Moving in the opposite direction, it 
is shown that, rather than partitioning a string of scalars into blocks and then into blocks of 
blocks, one can partition a scalar number into blocks of bits and then include the operation 
of multiplication in the signal processing formulation. This is called distributed arithmetic 
[45] and, since it describes operations at the bit level, is completely general. These notes try 
to present a coherent development of these ideas. 
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13.4 Block Signal Processing 


In this section the usual convolution and recursion that implements FIR and IIR discrete- 
time filters are reformulated in terms of vectors and matrices. Because the same data is 
partitioned and grouped in a variety of ways, it is important to have a consistent notation 
in order to be clear. The n‘” element of a data sequence is expressed h(n) or, in some cases 
to simplify, h,. A block or finite length column vector is denoted h,, with n indicating the 
n' block or section of a longer vector. A matrix, square or rectangular, is indicated by an 
upper case letter such as H with a subscript if appropriate. 


13.4.1 Block Convolution 


The operation of a finite impulse response (FIR) filter is described by a finite convolution as 


iS Sn (k) c(n—k) (13.4) 


where x(n) is causal, h(n) is causal and of length L, and the time index n goes from zero 
to infinity or some large value. With a change of index variables this becomes 


y(n) = So h(n—k) x(k) (13.5) 


which can be expressed as a matrix operation by 


Yo ho 0 OQ --- O Xo 
hy ho O x 
a 1 No aM (13.6) 


Y2 ho hy ho x2 


The H matrix of impulse response values is partitioned into N by N square sub matrices 
and the X and Y vectors are partitioned into length-N blocks or sections. This is illustrated 
for N = 3 by 


ho O O h3 hg hy 
Ho =| hi fo 0 Ay=] hy hz he etc. (13.7) 
hg hy ho hs hg hg 
Xo £3 Yo 
Lo= | XH Ly =| Xe Y= 1M etc. (13.8) 
Xo rs Yo 
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Substituting these definitions into (13.6) gives 


Yo Hp 0 O ::: 0 Lo 
A, AA 0 x 
AP Wiesel ee r= (13.9) 
Ys Hy Hy Ho Lo 
The general expression for the n™ output block is 
i = Si, Hie (13.10) 
k=0 


which is a vector or block convolution. Since the matrix-vector multiplication within the 

block convolution is itself a convolution, (13.10) is a sort of convolution of convolutions and 
the finite length matrix-vector multiplication can be carried out using the FFT or other fast 
convolution methods. 


The equation for one output block can be written as the product 


Lo 
y, = [Hei Ao] | 2, (13.11) 
£9 
and the effects of one input block can be written 
Ho oF 
Hy | a=| y, |- (13.12) 
A» Yo 


These are generalize statements of overlap save and overlap add [368], [147]. The block 
length can be longer, shorter, or equal to the filter length. 


13.4.2 Block Recursion 


Although less well-known, IIR filters can also be implemented with block processing [145], 
[74], [396], [43], [44]. The block form of an IIR filter is developed in much the same way as 
for the block convolution implementation of the FIR filter. The general constant coefficient 
difference equation which describes an IIR filter with recursive coefficients aj, convolution 
coefficients b;,, input signal x(n), and output signal y(n) is given by 


N-1 M-1 
y(n) = 7 Yn—1 + So bk Ln—k (13.13) 
[=1 k=0 
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using both functional notation and subscripts, depending on which is easier and clearer. 
The impulse response h(n) is 


N-1 M-1 
h(n) = Soarh(n-1) + Sob d (n—k) (13.14) 
I=1 k=0 
which can be written in matrix operator form 
1, 103 0) easel) ho bo 
ay 1 0 hy by 
dg a, 1 h b 
i ies Mie (13.15) 
a3 a2 ay hg bs 
0 a3 ag ha 0 
In terms of N by N submatrices and length-N blocks, this becomes 
Ap 0 O ::: O ho bo 
A, Ap O h b 
ne Resi (13.16) 


0 A; Apo 


= leo 
i) 
S 


From this formulation, a block recursive equation can be written that will generate the 
impulse response block by block. 


Ag h,, + Ay nea = 0 for n > 2 (13.17) 


h, = —Ag* Ay h,_, = K hy; for n > 2 (13.18) 


with initial conditions given by 


hy = —Ag*A1Ap* bg + Ap’ dy (13.19) 

This can also be written to generate the square partitions of the impulse response matrix 
by 

Hy = torn So (13.20) 


with initial conditions given by 


H, = KAj'Bo+ Ap’ Bi (13524; 
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ane K = —Aj'A,. This recursively generates square submatrices of H similar to those 
defined in (13.7) and (13.9) and shows the basic structure of the dynamic system. 


Next, we develop the recursive formulation for a general input as described by the scalar 
difference equation (13.14) and in matrix operator form by 


1 O O 0 Yo bb O O 0 Xo 
a, l O Y1 by bo O vy 
a a, 1 by b, bd x 
2 1 Y2 = 2 91 U9 2 13.9 2) 
a3 ag ay, Y3 0 bo by X3 
0 ag ag Ya 0 0 b La 


which, after substituting the definitions of the sub matrices and assuming the block length 
is larger than the order of the numerator or denominator, becomes 


A 0 0 - Olly, By 10) “10> a5200] || oa 
A; Ao O B, Bo 0 ie 
oe aya] a (13.23) 
0 Ay Ao Yo 0 B Bo Lo 
From the partitioned rows of (13.24), one can write the block recursive relation 
Aoy,.,+ Ary, = BoSn4i + Biz, (13.24) 
Solving for Das gives 
Yt ae —Ag* Ai u at Ag ' Bo Lys oF Ap Bi Ly (13.25) 
You = Ky, + Hedaya + Mion (13.26) 


which is a first order vector difference equation [43], [44]. This is the fundamental block 
recursive algorithm that implements the original scalar difference equation in (13.14). It has 
several important characteristics. 


e The block recursive formulation is similar to a state variable equation but the states 
are blocks or sections of the output [44], [220], [427], [428]. 

e The eigenvalues of K are the poles of the original scalar problem raised to the N power 
plus others that are zero. The longer the block length, the “more stable" the filter is, 
ie. the further the poles are from the unit circle [43], [44], [427], [15], [16]. 
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If the block length were shorter than the denominator, the vector difference equation 
would be higher than first order. There would be a non zero Ag. If the block length 
were shorter than the numerator, there would be a non zero By and a higher order 
block convolution operation. If the block length were one, the order of the vector 
equation would be the same as the scalar equation. They would be the same equation. 


The actual arithmetic that goes into the calculation of the output is partly recursive 
and partly convolution. The longer the block, the more the output is calculated by 
convolution and, the more arithmetic is required. 

It is possible to remove the zero eigenvalues in K by making K rectangular or square 
and N by N This results in a form even more similar to a state variable formulation 
[240], [44]. This is briefly discussed below in section 2.3. 

There are several ways of using the FFT in the calculation of the various matrix 
products in (13.25) and in (13.27) and (13.28). Each has some arithmetic advantage 
for various forms and orders of the original equation. It is also possible to implement 
some of the operations using rectangular transforms, number theoretic transforms, 
distributed arithmetic, or other efficient convolution algorithms [44], [427], [54], [48], 
[426], [286]. 

By choosing the block length equal to the period, a periodically time varying filter can 
be made block time invariant. In other words, all the time varying characteristics are 
moved to the finite matrix multiplies which leave the time invariant properties at the 
block level. This allows use of z-transform and other time-invariant methods to be 
used for stability analysis and frequency response analysis [244], [245]. It also turns 
out to be related to filter banks and multi-rate filters [222], [221], [97]. 


13.4.3 Block State Formulation 


It is possible to reduce the size of the matrix operators in the block recursive description 
(13.26) to give a form even more like a state variable equation [240], [44], [428]. If A in 
(13.26) has several zero eigenvalues, it should be possible to reduce the size of K until it has 
full rank. That was done in [44] and the result is 


Bn = Ky Zy_1 + Kok, (13.27) 


n—-1 





y = Mz + Hy x, (13.28) 


n-1 


where Ho is the same N by N convolution matrix, N, is a rectangular L by N partition of 
the convolution matrix H, kK, is asquare N by N matrix of full rank, and Ko is a rectangular 
N by L matrix. 


This is now a minimal state equation whose input and output are blocks of the original input 
and output. Some of the matrix multiplications can be carried out using the FFT or other 
techniques. 
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13.4.4 Block Implementations of Digital Filters 


The advantage of the block convolution and recursion implementations is a possible improve- 
ment in arithmetic efficiency by using the FFT or other fast convolution methods for some of 
the multiplications in (13.10) or (13.25) [246], [247]. There is the reduction of quantization 
effects due to an effective decrease in the magnitude of the eigenvalues and the possibility of 
easier parallel implementation for IIR filters. The disadvantages are a delay of at least one 
block length and an increased memory requirement. 


These methods could also be used in the various filtering methods for evaluating the DFT. 
This the chirp z-transform, Rader’s method, and Goertzel’s algorithm. 


13.4.5 Multidimensional Formulation 


This process of partitioning the data vectors and the operator matrices can be continued by 
partitioning (13.10) and (13.24) and creating blocks of blocks to give a higher dimensional 
structure. One should use index mapping ideas rather than partitioned matrices for this 
approach [3], [47]. 


13.4.6 Periodically Time-Varying Discrete-Time Systems 


Most time-varying systems are periodically time-varying and this allows special results to be 
obtained. If the block length is set equal to the period of the time variations, the resulting 
block equations are time invariant and all to the time varying characteristics are contained 
in the matrix multiplications. This allows some of the tools of time invariant systems to be 
used on periodically time-varying systems. 


The PTV system is analyzed in [425], [97], [81], [244], the filter analysis and design problem, 
which includes the decimation—interpolation structure, is addressed in [126], [245], [222], and 
the bandwidth compression problem in [221]. These structures can take the form of filter 
banks [387]. 


13.4.7 Multirate Filters, Filter Banks, and Wavelets 


Another area that is related to periodically time varying systems and to block processing 
is filter banks [387], [152]. Recently the area of perfect reconstruction filter banks has been 
further developed and shown to be closely related to wavelet based signal analysis [97], [99], 
[151], [387]. The filter bank structure has several forms with the polyphase and lattice being 
particularly interesting. 


An idea that has some elements of multirate filters, perfect reconstruction, and distributed 
arithmetic is given in [142], [140], [141]. Parks has noted that design of multirate filters has 
some elements in common with complex approximation and of 2-D filter design [337], [338] 
and is looking at using Tang’s method for these designs. 
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13.4.8 Distributed Arithmetic 


Rather than grouping the individual scalar data values in a discrete-time signal into blocks, 
the scalar values can be partitioned into groups of bits. Because multiplication of integers, 
multiplication of polynomials, and discrete-time convolution are the same operations, the bit- 
level description of multiplication can be mixed with the convolution of the signal processing. 
The resulting structure is called distributed arithmetic [45], [402]. It can be used to create 
an efficient table look-up scheme to implement an FIR or HR filter using no multiplications 
by fetching previously calculated partial products which are stored in a table. Distributed 
arithmetic, block processing, and multi-dimensional formulations can be combined into an 
integrated powerful description to implement digital filters and processors. There may be a 
new form of distributed arithmetic using the ideas in [140], [141]. 


13.5 Direct Fast Convolution and Rectangular Transforms 


A relatively new approach uses index mapping directly to convert a one dimensional con- 
volution into a multidimensional convolution [47], [8]. This can be done by either a type-1 
or type-2 map. The short convolutions along each dimension are then done by Winograd’s 
optimal algorithms. Unlike for the case of the DFT, there is no savings of arithmetic from 
the index mapping alone. All the savings comes from efficient short algorithms. In the case 
of index mapping with convolution, the multiplications must be nested together in the cen- 
ter of the algorithm in the same way as for the WFTA. There is no equivalent to the PFA 
structure for convolution. The multidimensional convolution can not be calculated by row 
and column convolutions as the DFT was by row and column DFTs. 


It would first seem that applying the index mapping and optimal short algorithms directly 
to convolution would be more efficient than using DFTs and converting them to convolution 
to be calculated by the same optimal algorithms. In practical algorithms, however, the DFT 
method seems to be more efficient [286]. 


A method that is attractive for special purpose hardware uses distributed arithmetic [45]. 
This approach uses a table look up of precomputed partial products to produce a system 
that does convolution without requiring multiplications [79]. 


Another method that requires special hardware uses number theoretic transforms [31], [237], 
[265] to calculate convolution. These transforms are defined over finite fields or rings with 
arithmetic performed modulo special numbers. These transforms have rather limited flexi- 
bility, but when they can be used, they are very efficient. 
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13.6 Number Theoretic Transforms for Convolution 


13.6.1 Results from Number Theory 


A basic review of the number theory useful for signal processing algorithms will be given 
here with specific emphasis on the congruence theory for number theoretic transforms [279], 
[165], [260], [237], [328]. 


13.6.2 Number Theoretic Transforms 


Here we look at the conditions placed on a general linear transform in order for it to support 
cyclic convolution. The form of a linear transformation of a length-N sequence of number is 
given by 


N-1 
X (k) = Sot (n,k) (n) (13.29) 
n=0 
for k =0,1,--- ,(N —1). The definition of cyclic convolution of two sequences is given by 
N-1 
yn): = Sox (m) h(n —m) (13.30) 
m=0 


for n = 0,1,--- ,(N —1) and all indices evaluated modulo N. We would like to find the 
properties of the transformation such that it will support the cyclic convolution. This means 
that if X (k), H (k), and Y (k) are the transforms of x(n), h(n), and y(n) respectively, 


Y (kK) =X (kh) ALK. (13.31) 
The conditions are derived by taking the transform defined in (13.4) of both sides of equation 
(13.5) which gives 


N-1 —1 


Y (k) = Sot(n,k) Sox (m) h(n-m) (13.32) 


n=0 =0 


N-1N-1 


= S°Sox(m) h(n—m) t(n,k). (13.33) 


m=0n=0 


Making the change of index variables, 1 = n — m, gives 


N-1N-1 


=S° So a(m) hd) t(l+m,k). (13.34) 


m=0 l=0 
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But from (13.6), this must be 


N-1 N-1 
Y (k) = Soa(n) t(n,k) Sox (m) t(m,k) (13.35) 
n=0 m=0 
N-1N-1 
=S5Soa(m) h(i) t(n,k) t(1,k). (13.36) 
m=0 1=0 
This must be true for all x(n), h(n), and k, therefore from (13.9) and (13.11) we have 
t(m+l1,k) =t(m,k) t(l,k) (13.37) 
For | = 0 we have 
t(m,k) =t(m,k) t(0,k) (13.38) 


and, therefore, t (0,k) = 1. For 1 =m we have 


t (2m, k) =t(m,k) t(m,k) = t? (m, k) (13.39) 


For 1 = pm we likewise have 


i(pm,k) =? (m,4) (13.40) 
and, therefore, 
t’ (m,k) =t(Nm,k) =t(0,k) = 1. (13.41) 
But 
t(m,k) =t™ (1,k) =t* (m,1), (13:42) 
therefore, 
tibet Od, 1). (13.43) 


Defining t (1,1) = a gives the form for our general linear transform (13.4) as 


= Sear (n) (13.44) 


where a is a root of order N , which means that N is the smallest integer such that a = 1. 


Theorem 1 The transform (13.13) supports cyclic convolution if and only if a is a root of 
order N and N~! is defined. 


This is discussed in [2], [4]. 
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Theorem 2 The transform (13.13) supports cyclic convolution if and only if 


NJO(M) (13.45) 
where 
O(M) = ged{p: — 1, p2—1,--- ,pr— 1} (13.46) 
and 
M = py py +++ pi. (13.47) 


This theorem is a more useful form of Theorem 1. Notice that Nmaz = O(M). 
One needs to find appropriate N, M, and a such that 


e WN should be appropriate for a fast algorithm and handle the desired sequence lengths. 


e M should allow the desired dynamic range of the signals and should allow simple 
modular arithmetic. 
e ashould allow a simple multiplication for a”* x (n). 


We see that if M is even, it has a factor of 2 and, therefore, O(M) = Nimax = 1 which 
implies M should be odd. If M is prime the O(M) = M — 1 which is as large as could be 
expected in a field of M integers. For M = 2* — 1, let k be a composite k = pq where p is 
prime. Then 2? —1 divides 2?4—1 and the maximum possible length of the transform will be 
governed by the length possible for 2? — 1. Therefore, only the prime k need be considered 
interesting. Numbers of this form are know as Mersenne numbers and have been used by 
Rader [311]. For Mersenne number transforms, it can be shown that transforms of length 
at least 2p exist and the corresponding a = —2. Mersenne number transforms are not of as 
much interest because 2p is not highly composite and, therefore, we do not have FF T-type 
algorithms. 


For M = 2* + 1 and k odd, 3 divides 2* + 1 and the maximum possible transform length is 
2. Thus we consider only even k. Let k = s2', where s is an odd integer. Then 2? divides 
2°" + 1 and the length of the possible transform will be governed by the length possible for 
2?" +1. Therefore, integers of the form M = 2? + 1 are of interest. These numbers are 
known as Fermat numbers [311]. Fermat numbers are prime for 0 < t < 4 and are composite 
for all t > 5. 


Since Fermat numbers up to Fy, are prime, O(F,) = 2° where b = 2! and we can have a 
Fermat number transform for any length N = 2” where m < b. For these Fermat primes 
the integer a = 3 is of order N = 2° allowing the largest possible transform length. The 
integer a = 2 is of order N = 2b = 2‘t!. This is particularly attractive since a to a power is 
multiplied times the data values in (13.4). 


The following table gives possible parameters for various Fermat number moduli. 
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t b M = F, No Ny Nea a for Nene 








8 | 22+1 16 | 32 256 
16 | 24%°+1 | 32 | 64 65536 
32 | 29741 | 64 | 128 | 128 
64 | 26441 | 128 | 256 | 256 
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Table 13.1 


This table gives values of N for the two most important values of a which are 2 and V2. The 
second column give the approximate number of bits in the number representation. The third 
column gives the Fermat number modulus, the fourth is the maximum convolution length 
for a = 2, the fifth is the maximum length for a = V2, the sixth is the maximum length 
for any a, and the seventh is the a for that maximum length. Remember that the first two 
rows have a Fermat number modulus which is prime and second two rows have a composite 
Fermat number as modulus. Note the differences. 


The books, articles, and presentations that discuss NTT and related topics are [209], [237], 
[265], [31], [253], [257], [288], [312], [311], [1], [55], [2], [4]. A recent book discusses NT in a 
signal processing context [215]. 
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Chapter 14 


Comments: Fast Fourier Transforms’ 


14.1 Other work and Results 


This section comes from a note describing results on efficient algorithms to calculate the 
discrete Fourier transform (DFT) that were collected over years. Perhaps the most interesting 
is the discovery that the Cooley-Tukey FFT was described by Gauss in 1805 [175]. That 
gives some indication of the age of research on the topic, and the fact that a 1995 compiled 
bibliography [363] on efficient algorithms contains over 3400 entries indicates its volume. 
Three IEEE Press reprint books contain papers on the FFT [303], [84], [85]. An excellent 
general purpose FFT program has been described in [132], [129] and is used in Matlab and 
available over the internet. 


In addition to this book there are several others [238], [266], [25], [170], [383], [254], [33], [37], 
[345] that give a good modern theoretical background for the FFT, one book [67] that gives 
the basic theory plus both FORTRAN and TMS 320 assembly language programs, and other 
books [219], [348], [70] that contain chapters on advanced FFT topics. A good up-to-date, 
on-line reference with both theory and programming techniques is in [11]. The history of the 
FFT is outlined in [87], [175] and excellent survey articles can be found in [115], [93]. The 
foundation of much of the modern work on efficient algorithms was done by S. Winograd. 
These results can be found in [412], [415], [418]. An outline and discussion of his theorems 
can be found in [219] as well as [238], [266], [25], [170]. 


Efficient FFT algorithms for length-2” were described by Gauss and discovered in modern 
times by Cooley and Tukey [91]. These have been highly developed and good examples of 
FORTRAN programs can be found in [67]. Several new algorithms have been published 
that require the least known amount of total arithmetic [423], [108], [104], [229], [394], [71]. 
Of these, the split-radix FFT [108], [104], [392], [366] seems to have the best structure for 
programming, and an efficient program has been written [351] to implement it. A mixture of 
decimation-in-time and decimation-in-frequency with very good efficiency is given in [323], 
[324] and one called the Sine-Cosine FT [71]. Recently a modification to the split-radix algo- 
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rithm has been described [203] that has a slightly better total arithmetic count. Theoretical 
bounds on the number of multiplications required for the FFT based on Winograd’s theories 
are given in [170], [172]. Schemes for calculating an in-place, in-order radix-2 FFT are given 
in [17], [19], [196], [379]. Discussion of various forms of unscramblers is given in [51], [321], 
[186], [123], [318], [400], [424], [370], [315]. A discussion of the relation of the computer 
architecture, algorithm and compiler can be found in [251], [242]. A modification to allow 
lengths of N = q2™ for q odd is given in [24]. 


The “other” FFT is the prime factor algorithm (PFA) which uses an index map originally 
developed by Thomas and by Good. The theory of the PFA was derived in [214] and further 
developed and an efficient in-order and in-place program given in [58], [67]. More results on 
the PFA are given in [377], [378], [379], [380], [364]. A method has been developed to use 
dynamic programming to design optimal FFT programs that minimize the number of addi- 
tions and data transfers as well as multiplications [191]. This new approach designs custom 
algorithms for a particular computer architecture. An efficient and practical development 
of Winograd’s ideas has given a design method that does not require the rather difficult 
Chinese remainder theorem [219], [199] for short prime length FFT’s. These ideas have been 
used to design modules of length 11, 13, 17, 19, and 25 [189]. Other methods for designing 
short DFT’s can be found in [376], [223]. A use of these ideas with distributed arithmetic 
and table look-up rather than multiplication is given in [80]. A program that implements 
the nested Winograd Fourier transform algorithm (WFTA) is given in [238] but it has not 
proven as fast or as versatile as the PFA [58]. An interesting use of the PFA was announced 
[75] in searching for large prime numbers. 


These efficient algorithms can not only be used on DFT’s but on other transforms with a 
similar structure. They have been applied to the discrete Hartley transform [354], [36] and 
the discrete cosine transform [394], [401], [314]. 


The fast Hartley transform has been proposed as a superior method for real data analysis 
but that has been shown not to be the case. A well-designed real-data FFT [360] is always 
as good as or better than a well-designed Hartley transform [354], [113], [289], [386], [371]. 
The Bruun algorithm [41], [369] also looks promising for real data applications as does the 
Rader-Brenner algorithm [310], [76], [386]. A novel approach to calculating the inverse DFT 
is given in [109]. 


General length algorithms include [340], [143], [125]. For lengths that are not highly com- 
posite or prime, the chirp z-transform in a good candidate [67], [307] for longer lengths and 
an efficient order-N? algorithm called the QFT [343], [157], [160] for shorter lengths. A 
method which automatically generates near-optimal prime length Winograd based programs 
has been given in [199], [330], [332], [334], [336]. This gives the same efficiency for shorter 
lengths (i.e. N < 19) and new algorithms for much longer lengths and with well-structured 
algorithms. Another approach is given in [285]. Special methods are available for very long 
lengths [183], [365]. A very interesting general length FFT system called the FFTW has 
been developed by Frigo and Johnson at MIT. It uses a library of efficient “codelets" which 
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are composed for a very efficient calculation of the DFT on a wide variety of computers 
[132], [129], [136]. For most lengths and on most computers, this is the fastest FFT today. 
Surprisingly, it uses a recursive program structure. The FFTW won the 1999 Wilkinson 
Prize for Numerical Software. 


The use of the FFT to calculate discrete convolution was one of its earliest uses. Although 
the more direct rectangular transform [9] would seem to be more efficient, use of the FFT 
or PFA is still probably the fastest method on a general purpose computer or DSP chip 
[287], [360], [113], [241]. On special purpose hardware or special architectures, the use of 
distributed arithmetic [80] or number theoretic transforms [5] may be even faster. Special 
algorithms for use with the short-time Fourier transform [346] and for the calculation of a few 
DFT values [349], [316], [347] and for recursive implementation [399], [129] have also been 
developed. An excellent analysis of efficient programming the FFT on DSP microprocessors 
is given in [243], [242]. Formulations of the DFT in terms of tensor or Kronecker products 
look promising for developing algorithms for parallel and vector computer architectures [361], 
[383], [200], [390], [385], [154], [153). 


Various approaches to calculating approximate DFTs have been based on cordic methods, 
short word lengths, or some form of pruning. A new method that uses the characteristics of 
the signals being transformed has combined the discrete wavelet transform (DWT) combined 
with the DFT to give an approximate FFT with O (NV) multiplications [162], [164], [69] for 
certain signal classes. A similar approach has been developed using filter banks [339], [185]. 


The study of efficient algorithms not only has a long history and large bibliography, it is still 
an exciting research field where new results are used in practical applications. 


More information can be found on the Rice DSP Group’s web page? 





“http: //www-dsp.rice.edu 
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Chapter 15 


Conclusions: Fast Fourier Transforms’ 


This book has developed a class of efficient algorithms based on index mapping and polyno- 
mial algebra. This provides a framework from which the Cooley-Tukey FFT, the split-radix 
FFT, the PFA, and WFTA can be derived. Even the programs implementing these algo- 
rithms can have a similar structure. Winograd’s theorems were presented and shown to be 
very powerful in both deriving algorithms and in evaluating them. The simple radix-2 FFT 
provides a compact, elegant means for efficiently calculating the DFT. If some elaboration 
is allowed, significant improvement can be had from the split-radix FFT, the radix-4 FFT 
or the PFA. If multiplications are expensive, the WFTA requires the least of all. 


Several method for transforming real data were described that are more efficient than directly 
using a complex FFT. A complex FFT can be used for real data by artificially creating a 
complex input from two sections of real input. An alternative and slightly more efficient 
method is to construct a special FFT that utilizes the symmetries at each stage. 


As computers move to multiprocessors and multicore, writing and maintaining efficient pro- 
grams becomes more and more difficult. The highly structured form of FFTs allows auto- 
matic generation of very efficient programs that are tailored specifically to a particular DSP 
or computer architecture. 


For high-speed convolution, the traditional use of the FFT or PFA with blocking is proba- 
bly the fastest method although rectangular transforms, distributed arithmetic, or number 
theoretic transforms may have a future with special VLSI hardware. 


The ideas presented in these notes can also be applied to the calculation of the discrete 
Hartley transform [355], [112], the discrete cosine transform [119], [395], and to number 
theoretic transforms [32], [239], [267]. 


There are many areas for future research. The relationship of hardware to algorithms, the 
proper use of multiple processors, the proper design and use of array processors and vector 
processors are all open. There are still many unanswered questions in multi-dimensional 
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algorithms where a simple extension of one-dimensional methods will not suffice. 
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Appendix 1: FFT Flowgraphs' 


16.1 Signal Flow Graphs of Cooley-Tukey FFTs 


The following four figures are flow graphs for Radix-2 Cooley-Tukey FFTs. The first is a 
length-16, decimation-in-frequency Radix-2 FFT with the input data in order and output 
data scrambled. The first stage has 8 length-2 "butterflies" (which overlap in the figure) 
followed by 8 multiplications by powers of W which are called "twiddle factors". The second 
stage has 2 length-8 FFTs which are each calculated by 4 butterflies followed by 4 multiplies. 
The third stage has 4 length-4 FFTs, each calculated by 2 butterflies followed by 2 multiplies 
and the last stage is simply 8 butterflies followed by trivial multiplies by one. This flow graph 
should be compared with the index map in Polynomial Description of Signals (Chapter 4), 
the polynomial decomposition in The DFT as Convolution or Filtering (Chapter 5), and the 
program in Appendix 3. In the program, the butterflies and twiddle factor multiplications 
are done together in the inner most loop. The outer most loop indexes through the stages. 
If the length of the FFT is a power of two, the number of stages is that power (log N). 


The second figure below is a length-16, decimation-in-time FFT with the input data scram- 
bled and output data in order. The first stage has 8 length-2 "butterflies" followed by 8 
twiddle factors multiplications. The second stage has 4 length-4 FFTs which are each cal- 
culated by 2 butterflies followed by 2 multiplies. The third stage has 2 length-8 FFTs, each 
calculated by 4 butterflies followed by 8 multiplies and the last stage is simply 8 length-2 
butterflies. This flow graph should be compared with the index map in Polynomial Descrip- 
tion of Signals (Chapter 4), the polynomial decomposition in The DFT as Convolution or 
Filtering (Chapter 5), and the program in Appendix 3 (Chapter 18). Here, the FFT must 
be preceded by a scrambler. 


The third and fourth figures below are a length-16 decimation-in-frequency and a decimation- 
in-time but, in contrast to the figures above, the DIF has the output in order which requires 
a scrambled input and the DIT has the input in order which requires the output be unscram- 
bled. Compare with the first two figures. Note the order of the twiddle factors. The number 
of additions and multiplications in all four flow graphs is the same and the structure of the 
three-loop program which executes the flow graph is the same. 
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Figure 16.1: Length-16, Decimation-in-Frequency, In-order input, Radix-2 FFT 
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Figure 16.2: Length-16, Decimation-in-Time, In-order output, Radix-2 FFT 
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Figure 16.3: Length-16, alternate Decimation-in-Frequency, In-order output, Radix-2 
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Figure 16.4: Length-16, alternate Decimation-in-Time, In-order input, Radix-2 FFT 





The following is a length-16, decimation-in-frequency Radix-4 FFT with the input data in 
order and output data scrambled. There are two stages with the first stage having 4 length- 
4 "butterflies" followed by 12 multiplications by powers of W which are called "twiddle 
factors. The second stage has 4 length-4 FFTs which are each calculated by 4 butterflies 
followed by 4 multiplies. Note, each stage here looks like two stages but it is one and 
there is only one place where twiddle factor multiplications appear. This flow graph should 
be compared with the index map in Polynomial Description of Signals (Chapter 4), the 
polynomial decomposition in The DFT as Convolution or Filtering (Chapter 5), and the 
program in Appendix 3 (Chapter 18). Log to the base 4 of 16 is 2. The total number of 
twiddle factor multiplication here is 12 compared to 24 for the radix-2. The unscrambler is 
a base-four reverse order counter rather than a bit reverse counter, however, a modification 
of the radix four butterflies will allow a bit reverse counter to be used with the radix-4 FFT 
as with the radix-2. 
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Figure 16.5: Length-16, Decimation-in-Frequency, In-order input, Radix-4 FFT 





The following two flowgraphs are length-16, decimation-in-frequency Split Radix FFTs with 
the input data in order and output data scrambled. Because the "butterflies" are L shaped, 
the stages do not progress uniformly like the Radix-2 or 4. These two figures are the same 
with the first drawn in a way to compare with the Radix-2 and 4, and the second to illustrate 
the L shaped butterflies. These flow graphs should be compared with the index map in 
Polynomial Description of Signals (Chapter 4) and the program in Appendix 3 (Chapter 18). 
Because of the non-uniform stages, the program indexing is more complicated. Although 
the number of twiddle factor multiplications is 12 as was the radix-4 case, for longer lengths, 
the split-radix has slightly fewer multiplications than the radix-4. 


Because the structures of the radix-2, radix-4, and split-radix FFTs are the same, the number 
of data additions is same for all of them. However, each complex twiddle factor multiplication 
requires two real additions (and four real multiplications) the number of additions will be 
fewer for the structures with fewer multiplications. 
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Figure 16.6: Length-16, Decimation-in-Frequency, 








In-order input, Split-Radix FFT 
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Figure 16.7: Length-16, Decimation-in-Frequency, Split-Radix with special BFs FFT 
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Appendix 2: Operation Counts for 
General Length FFT’ 


17.1 Figures 


The Glassman-Ferguson FFT is a compact implementation of a mixed-radix Cooley-Tukey 
FFT with the short DFTs for each factor being calculated by a Goertzel-like algorithm. This 
means there are twiddle factor multiplications even when the factors are relatively prime, 
however, the indexing is simple and compact. It will calculate the DFT of a sequence of any 
length but is efficient only if the length is highly composite. The figures contain plots of 
the number of floating point multiplications plus additions vs. the length of the FFT. The 
numbers on the vertical axis have relative meaning but no absolute meaning. 




















Figure 17.1: Flop-Count vs Length for the Glassman-Ferguson FFT 





Note the parabolic shape of the curve for certain values. The upper curve is for prime lengths, 
the next one is for lengths that are two times a prime, and the next one is for lengths that 
are for three times a prime, etc. The shape of the lower boundary is roughly N log N. The 
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program that generated these two figures used a Cooley-Tukey FFT if the length is two to 
a power which accounts for the points that are below the major lower boundary. 
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Figure 17.2: Flop-Count vs Length for the Glassman-Ferguson FFT 
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Appendix 3: FFT Computer Programs’ 


18.1 Goertzel Algorithm 


A FORTRAN implementation of the first-order Goertzel algorithm with in-order input as 
given in () and [68] is given below. 
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C GOERTZEL’S DFT ALGORITHM 
C First order, input inorder 
C C. S. BURRUS, SEPT 1983 


SUBROUTINE DFT(X,Y,A,B,N) 
REAL X(260), Y(260), A(260), B(260) 
Q = 6.283185307179586/N 
DO 20 J=1, N 
Cc COS (Q* (J-1)) 
S SIN (Q* (J-1)) 
AT = X(1) 
BT = Y(1) 
DO 30 I = 2, N 
T = C*xAT - S*BT + X(I) 
BT = C*BT + S*AT + YCI) 
AT = T 
30 CONTINUE 
A(J) = CxAT - S*BT 
B(J) = C*BT + S*AT 
20 CONTINUE 
RETURN 
END 


Listing 18.1: First Order Goertzel Algorithm 





18.2 Second Order Goertzel Algorithm 


Below is the program for a second order Goertzel algorithm. 
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C Be Sy Se Ey a 
C  GOERTZEL’S DFT ALGORITHM 
C Second order, input inorder 
C C. S. BURRUS, SEPT 1983 
C Ne separ a Se ae gc eee ea at a cea ae a er aye ae ye Se ee 
SUBROUTINE DFT(X,Y,A,B,N) 
REAL X(260), Y(260), A(260), B(260) 
G 
Q = 6.283185307179586/N 
DO 20 J=1, N 
C = COS(Q*(J-1)) 
S = SIN(Q*(J-1)) 
CG: = 2c 
AD = 50 
BQ =0 
A1 = X(1) 
Bi = Y(1) 
DO 30 I = 2, N 
T= hi 
A1 = CC*A1 - A2 + X(T) 
ist 
T= Bt 
Bi = CC*B1 - B2 + Y(I) 
B2 =T 
30 CONTINUE 


A(J) = C*A1 - A2 - S*B1 
B(J) = C*B1 - B2 + S*Al 
20 CONTINUE 


RETURN 
END 


Listing 18.2: Second Order Goertzel Algorithm 





18.3 Second Order Goertzel Algorithm 2 


Second order Goertzel algorithm that calculates two outputs at a time. 
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C GOERTZEL’S DFT ALGORITHM, Second order 
C Input inorder, output by twos; C.S. Burrus, SEPT 1991 


SUBROUTINE DFT(X,Y,A,B,N) 
REAL X(260), Y(260), A(260), B(260) 
Q = 6.283185307179586/N 
DO 20 J=1, N/2+1 
Cc COS (Q* (J-1)) 
S SIN (Q* (J-1)) 
CC: = .2*¢ 
A2=0 
B2 = 0 
At = X(1) 
Bi = Y(1) 
DO 30 I = 2, N 
T = Al 
A1 = CC*A1 - A2 + X(I) 
NO S07 
T = Bi 
Bi = CC*B1 - B2 + Y(I) 
Bo 2.7 
30 CONTINUE 
AQ. = C¥A1 = AQ 
T -=S#Bt 
A(J) SAS. ST 
A(N-J+2) = A2Q+T 
B2 = CxB1i - B2 
T. Seat 
B(J) = B2+T 
B(N-J+2) = B2 - T 
20 CONTINUE 
RETURN 
END 


Figure. Second Order Goertzel Calculating Two Outputs at a Time 


18.4 Basic QFT Algorithm 


A FORTRAN implementation of the basic QFT algorithm is given below to show how the 
theory is implemented. The program is written for clarity, not to minimize the number of 
floating point operations. 
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C 


SUBROUTINE QDFT(X,Y,XX, YY, NN) 
REAL X(0:260) ,Y(0:260) ,XX(0:260) , YY(0:260) 


SSX + COS(3.1426*K) *X (N21) 
SSY + COS(3.1426*K) *Y (N21) 


+ X(NN-N) ) *COS (Q*N*K) 
+ Y(NN-N) ) *COS (Q*N*K) 
- X(NN-N))*SIN(Q*N*K) 
- YC(NN-N))*SIN(Q*N*K) 


C 
Ni = NN - 1 
No] Ni/2 
N21 = NN/2 
Q = 6.283185308/NN 
DO 2K =0, N21 
SSX = X(0) 
SSY = Y(0) 
SDX = 0 
SDY = 0 
IF (MOD(NN,2).EQ.0) THEN 
SSX = 
SSY = 
ENDIF 
DO3 N=1, N2 
SSX = SSX + (X(N) 
SSY = SSY + (Y(N) 
SDX = SDX + (X(N) 
SDY = SDY + (Y(N) 
3 CONTINUE 
XX(K) = SSX + SDY 
YY(K) = SSY - SDX 
XX(NN-K) = SSX - SDY 
YY(NN-K) = SSY + SDX 
2 CONTINUE 
RETURN 
END 


Listing 18.3: Simple QFT Fortran Program 
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18.5 Basic Radix-2 FFT Algorithm 


Below is the Fortran code for a simple Decimation-in-Frequency, Radix-2, one butterfly 
Cooley-Tukey FFT followed by a bit-reversing unscrambler. 


C 
C A COOLEY-TUKEY RADIX-2, DIF FFT PROGRAM 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
Cc C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 
C aN tees EAS a ete ogc een ee Sc ene aa aaa Ste Pea al a es ace YS ee ee eee fe 
SUBROUTINE FFT (X,Y,N,M) 
REAL X(1), Y(1) 
(S22 S25s5e2 3555 MAIN: PET RQ0PS]25se Sse soe s oe Soe ee Seeks 
G 
N2=N 
DO 10 K = 1, M 
Ni = N2 
N2=N2/2 
E = 6.283185307179586/N1 
A =O 
DO 20 J = 1, N2 
cC = COS (A) 
S = SIN (A) 
A= JE 
DO 30 I= J, N, Ni 
E= I+ N2 
XT = X(I) - X(L) 
X(1)- = XCD + XL) 
VI° p= ¥CT) <= ¥CL) 
Y(I) = Y(I) + Y(L) 
X(L) = C*XT + S*YT 
Vi y= CENT -= SXT 
30 CONTINUE 
20 CONTINUE 
10 CONTINUE 
G 
Geee Sa ese DIGIT REVERSE COUNTER----------------- 
100 Je=l 


Nie 24 

DO 104 I=1, N1 
IF (I.GE.J) GOXTO 101 
XT = X(J) 
X(J) = X(T) 
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X(I) = XT 
XT = Y(J) 
Y¥(J) = Y(T) 
Y(I) = XT 
101 K = N/2 
102 IF (K.GE.J) GOTO 103 
iS ak 
K = K/2 
GOTO 102 
103 Je JK 
104 CONTINUE 
RETURN 


END 


Figure: Radix-2, DIF, One Butterfly Cooley-Tukey FFT 


18.6 Basic DIT Radix-2 FFT Algorithm 


Below is the Fortran code for a simple Decimation-in-Time, Radix-2, one butterfly Cooley- 
Tukey FFT preceeded by a bit-reversing scrambler. 


C 
C A COOLEY-TUKEY RADIX-2, DIT FFT PROGRAM 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
Cc C. S. BURRUS, RICE UNIVERSITY, SEPT 1985 
Cc 
Gsesess— soe - ee] 5 Sasa = Se HS = ea a See Sa Se Se ae See eae = 
SUBROUTINE FFT (X,Y,N,M) 
REAL X(1), Y(1) 
Geassssrssss2 DIGIT REVERSE COUNTER----------------- 
Cc 
100°: 0 = 4. 


Ni=N-1 
DO 104 I=1, Ni 
IF (I.GE.J) GOTO 101 


XT = X(J) 
X(J) = X(T) 
X(I) = XT 
tT -=-¥C) 
¥(J) = Y(I) 
Y(I) = XT 
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101 K = N/2 
102 IF (K.GE.J) GOTO 103 
J=J-K 
K = K/2 
GOTO 102 
103 J=J+kK 
104 CONTINUE 
C-------------- MAIN FFT LOOPS----------------------------- 
C 
N2= 1 


DO 10 K=1,M 
E = 6.283185307179586/ (2*N2) 


A =0 
DO 20 J=1, N2 
C = COS (A) 
S = SIN (A) 
A = Jx*E 
DO 30 I = J, N, 2*N2 
L= 1+ N2 
XT = C¥*X(L) + S*Y(L) 
YT = C¥Y(L) - S*X(L) 
0) = XC) = RE 
X(I) = X(I) + XT 
¥ (CL) 209 CL) 2¥T 
Y(I) = YC(I) + YT 
30 CONTINUE 
20 CONTINUE 
N2 = N2+N2 
10 CONTINUE 
C 
RETURN 
END 


18.7 DIF Radix-2 FFT Algorithm 


Below is the Fortran code for a Decimation-in-Frequency, Radix-2, three butterfly Cooley- 
Tukey FFT followed by a bit-reversing unscrambler. 


C A COOLEY-TUKEY RADIX 2, DIF FFT PROGRAM 
C THREE-BF, MULT BY 1 AND J ARE REMOVED 
COMPLEX INPUT DATA IN ARRAYS X AND Y 
C TABLE LOOK-UP OF W VALUES 


Q 
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C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 


SUBROUTINE FFT (X,Y,N,M,WR,WI) 
REAL X(1), Y(1), WR(1), WI(1) 


N2=N 

DO 10 K = 1, M 
Ni = N2 
N2 = N2/2 
JT =-N2/2 +4. 
pOitile=1,N, Nl 
Ee oN? 
T X(I) - X(L) 
X(I) = X(I) + X(L) 
X(L) = T 
T = Y(I) - Y(L) 
Y(I) = Y(I) + Y(L) 
Y(L) = T 

1 CONTINUE 
IF (K.EQ.M) GOTO 10 
IE = N/N1 
Ta) 24 

DO 20:°J°= 2, N2 

IA = IA + IE 

IF (J.EQ.JT) GOTO 50 

C = WR(IA) 

S = WI(IA) 

DO 30 I= J, N, Ni 
Lee 
T = X(I) - X(L) 
X(I) = X(I) + X(L) 
TY: . 0= YT) = YQ) 
Y(I) = YCI) + Y(L) 
X(L) = C*T + S*TY 
Y(L) = C*TY - ST 


30 CONTINUE 
GOTO 25 
50 DO 40 I= J, N, Ni 
L=1I1+ N2 
T = X(I) - X(L) 


X(I) = X(I) + X(L) 
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TY = Y(I) - Y(L) 
Y(I) = Y(I) + Y(L) 
X(L) = TY 
Y(L) =-T 

40 CONTINUE 

25 A = Jx*E 

20 CONTINUE 

10 CONTINUE 

Cis e=seocesca DIGIT REVERSE COUNTER Goes here---------- 
RETURN 
END 


18.8 Basic DIF Radix-4 FFT Algorithm 


Below is the Fortran code for a simple Decimation-in-Frequency, Radix-4, one butterfly 
Cooley-Tukey FFT to be followed by an unscrambler. 


C <A COOLEY-TUKEY RADIX-4 DIF FFT PROGRAM 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C LENGTH IS N = 4 ** M 
C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 


SUBROUTINE FFT4 (X,Y,N,M) 
REAL X(1), Y(1) 


N2 = N2/4 
A=0 


DO 20 J=1, N2 
B =f +4 
Cc =A+B 
CO1 = COS(A) 
C02 = COS(B) 
cO3 = COS(C) 
SIi = SIN(A) 
SI2 = SIN(B) 
SI3. = SIN(C) 
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A = J*E 
C---------------- BUTTERFLIES WITH SAME W--------------- 

DO 30 I=-J, N, Ni 
Ii =I + N2 
I2 = Ii + N2 
I3 = 12 + N2 
Ri = X(I ) + X(12) 
R3 = X(I ) - X(12) 
Si = Y(I ) + Y(12) 
S3 = Y(I ) - Y(I2) 
R2 = X(I1) + X(13) 
R4 = X(I1) - X(13) 
B25 YT) e YCTS) 
$4 = Y(I1) - Y(I3) 
X(T) So Ri: & Re 
R2 = Ri - R2 
Ri = R3 - 84 
R3 = R3 + S84 
Y(I) = Si + $2 
S2 = 81 - 82 
Sl = $83 + R4 
53 = 83 - R4 
X(I1) = CO1*R3 + SI1*S3 
Y(I1) = CO1*S3 - SI1*R3 
X(12) = CO2*R2 + SI2*S2 
Y(I2) = CO2*S2 - SI2*R2 
X(13) = CO3*R1 + SI3*S1 
Y(I3) = CO3*S1 - SI3*R1 

30 CONTINUE 

20 CONTINUE 

10 CONTINUE 

C----------- DIGIT REVERSE COUNTER goes here----- 
RETURN 
END 


18.9 Basic DIF Radix-4 FFT Algorithm 


177 


Below is the Fortran code for a Decimation-in-Frequency, Radix-4, three butterfly Cooley- 
Tukey FFT followed by a bit-reversing unscrambler. Twiddle factors are precalculated and 
stored in arrays WR and WI. 


C 
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ARARQAQQAQ2 


QaQ 


APPENDIX 


A COOLEY-TUKEY RADIX-4 DIF FFT PROGRAM 


THREE BF, MULTIPLICATIONS BY 1, 


J, ETC. ARE REMOVED 


COMPLEX INPUT DATA IN ARRAYS X AND Y 


LENGTH IS N = 4 ** M 
TABLE LOOKUP OF W VALUES 


C. S. BURRUS, RICE UNIVERSITY, 


SUBROUTINE FFT4 (X,Y,N,M,WR,WI) 
REAL X(1), Y(1), WR(1), WI(1) 
DATA C21 / 0.707106778 / 


N2=N 

DO 10 K=1, M 
Ni = N2 
N2 = N2/4 


JT = N2/2 + 1 


SEPT 1983 


pSstece sence SPECTAL “RUTIERFLY ‘FOR WS 1---2sse-ss2e42- 


DOiti=1i1,N, Ni 


Ii =I + N2 
I2 = I1 + N2 
I3 = 12 + N2 


Ri = X(I ) + X(12) 
R3 = X(I ) - X(12) 
Si = Y(I ) + Y(I2) 
S3 = Y(I ) - Y(12) 
R2 = X(I1) + X(I3) 
R4 = X(I1) - X(I3) 
B22. = 011). Ys) 
$4 = Y(I1) - Y(13) 


X(I) = Ri + R2 
X(12)= R1 - R2 
X(I3)= R3 - S4 
X(I1)= R3 + S4 


Y(I) = Si + $2 
Y(12)= $1 - $2 
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¥(I3)= $3 + R4 
¥(I1)= $3 - R4 


1 CONTI 


NUE 


IF (K.EQ.M) GOTO 10 


IE = N/N1 
TAL = 1 


DO 20 J = 2, N2 
IA1 = [Ai + IE 


IF (J.EQ.JT) GOTO 50 
IA2 = ITA1 + IA1 - 1 


TA3 
col 
C02 
C03 
sIi 
sI2 
S13 


Ii = 
I2 = 
I3 = 
Ri = 
R3 = 
Si = 
53 = 
R2 = 
R4 = 
52 = 
54 = 


X(T) 
R2 
Ri 
R3 


YT) 
$2 
S1 
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= IA2 + IA1 - 1 


= WR(IA1) 
= WR(IA2) 
= WR(IA3) 
= WI(ITA1) 
= WI(TA2) 
= WI(TA3) 


BUTTERFLIES WITH SAME W 
DO 30 I= J, N, Ni 


Te PND 
Tiki 
12 + N2 
X(I _) + X(12) 


ECE 
¥(I) 
VCE) 
X(I1) 
X(I1) 
Y(I1) 
Y(I1) 


= Rl 
= Rl 
= R3 
= R3 


= $1 
= $1 
= $3 


+ 


+ 


+ 


X(I2) 
Y(I2) 
Y(I2) 
X(I3) 
X(I3) 
Y(I3) 
Y(I3) 


R2 
R2 
54 
54 


52 
52 
R4 
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180 


53 


X(I1) 
YU) 
X(12) 
Y(I2) 
X(I3) 
¥C1S) 


= $3 - R4 


CO1*R3 
C01*S3 
CO2*R2 
CO2*S2 
CO3*R1 
= C03*S1 


CONTINUE 


GOTO 20 


DO 40 I = 


X(T) 
Y(I2) 
Ri 
R3 


Y(I) 
X(12) 
Si 
$3 


X(I1) 
Y(I1) 
X(I3) 
Y(I3) 
CONTINUE 
CONTINUE 


CONTINUE 
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I + N2 

Ii + N2 

I2 + N2 

RAT i te R 
X(I ) - X 
YC.) + ¥ 
VC) -e¥ 
X(I1) + X 
X(I1) - X 
YC) ecyY 
Y(I1) - Y 


Ri + R2 
-R1 + R2 
= R3 - S84 
R3 + 54 


S51 + $2 
Si - $2 
= $3 + R4 
53 - R4 


+ $1I1*53 
- SI1*R3 
+ SI2*52 
- §12*R2 
+ SI3*S1 
- SI3*R1 


J, N, Ni 


(12) 
(12) 
(12) 
(12) 
(13) 
(13) 
(13) 
(13) 


= ($3 + R3)*C21 
= ($3 - R3)*C21 
= (S1 - R1)*C21 
=-(S1 + R1)*C21 
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106: =" 
Nie ot 
DO 104 I = 1, N1 
IF (I.GE.J) GOTO 101 
Ri X(J) 
X(J) = X(T) 
KGL) = Rt 
Ri = Y(J) 
Y¥(J) = Y(T) 
Y(I) = Ri 
101 K = N/4 
102 IF (K*3.GE.J) GOTO 103 
J= J - Kx3 
K = K/4 
GOTO 102 
103 J=jJ+kK 
104 CONTINUE 
RETURN 
END 


18.10 Basic DIF Split Radix FFT Algorithm 


Below is the Fortran code for a simple Decimation-in-Frequency, Split-Radix, one butterfly 
FFT to be followed by a bit-reversing unscrambler. 


C A DUHAMEL-HOLLMANN SPLIT RADIX FFT PROGRAM 
FROM: ELECTRONICS LETTERS, JAN. 5, 1984 
COMPLEX INPUT DATA IN ARRAYS X AND Y 
LENGTH IS N = 2 ** M 

C. S. BURRUS, RICE UNIVERSITY, MARCH 1984 


AAQaa 


Q 


SUBROUTINE FFT (X,Y,N,M) 
REAL X(1), Y(1) 


Ni =WN 

N2 = N/2 

IP = 0 

Is = 1 

6 .283185307179586/N 


> 
i} 
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30 
20 


DO 10 K = 1, M-1 


JD 
Ni 
N2 
JO 
IP 
DO 


JT 


Ni + N2 

N2 

N2/2 

Ni*IP + 1 

SL = IP 

20 J = JO, N, JD 
JS = 0 

=J+N2-1 

DO 30 I= J, JT 
JSS= JS*1IS 

JS = JS +1 


C1 = COS(A*JSS) 
C3 = COS(3*A*JSS) 
S1 = -SIN(A*JSS) 


S3 = -SIN(3*A*JSS) 


TiS 1 + N2 
Losec No 

I3 = 12 + N2 

Ri =X(I) + 
R2 = X(I ) - 
R3 = CCl) 
X(I12) = X(I1) + 
X(I1) = R1 

Ri =Y(I) + 
R4 = Y(I ) - 
R5 = M11) = 
YCI2) 20h) 
Y(Ii) = R1 

Ri = R2 - R5 
R2 = R2 + R5 
R5 = R4 + R3 
R4 = R4 - R3 


X(I) = C1*R1 + 
Y(I) = C1*R5 - 
X(I3) = C3*R2 + 
Y(I3) = C3*R4 - 
CONTINUE 
CONTINUE 
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X(12) 
X(I2) 
X(I3) 
X(I3) 


Y(I2) 
V E12) 
Y(I3) 
Y(I3) 


51*R5 
S1*R1 
53*R4 
53*R2 
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Is = IS + IS 
10 CONTINUE 
IP = 1 - IP 
JO = 2 - IP 
DO 5 1 = JO, N-1, 3 
I1=I+1 
R1 = X(I) + X(I1) 
BOM) SCL) = RCL) 
X(I) = Ril 
RE SC) v1) 
YC) SCL) = V1) 
V1). *=5R1 
5 CONTINUE 
RETURN 
END 


18.11 DIF Split Radix FFT Algorithm 


Below is the Fortran code for a simple Decimation-in-Frequency, Split-Radix, two butterfly 
FFT to be followed by a bit-reversing unscrambler. Twiddle factors are precalculated and 
stored in arrays WR and WI. 


Casa SSeS Sse Soss = Soe So Se See Sas SSS SSeS SStes SSeS sss e eS C 
C A DUHAMEL-HOLLMAN SPLIT RADIX FFT C 
C REF: ELECTRONICS LETTERS, JAN. 5, 1984 C 
C COMPLEX INPUT AND OUTPUT DATA IN ARRAYS X AND Y C 
C LENGTH IS N = 2 ** M, OUTPUT IN BIT-REVERSED ORDER C 
C TWO BUTTERFLIES TO REMOVE MULTS BY UNITY C 
C SPECIAL LAST TWO STAGES C 
C TABLE LOOK-UP OF SINE AND COSINE VALUES C 
C C.S. BURRUS, RICE UNIV. APRIL 1985 C 
CpSsh SS Ssh aS SoS ae eee eet ee See eee Sees Se ee C 
C 


SUBROUTINE FFT(X,Y,N,M,WR,WI) 
REAL X(1),Y(1) ,WR(1) ,WI(1) 
C81= 0.707106778 


N2 = 2«N 

DO 10K =1, M-3 
Is =1 

ID = N2 

N2 = N2/2 

N4 = N2/4 
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40 


DO 1 I0 = IS, N-1, ID 
Ii = 10 + N4 
I2 = Ii + N4 
I3 = 12 + N4 
X(I0) - X(12) 
X(I0) + X(12) 
¥CT1). ==YC3) 
Y(I1) + Y(13) 
Ria Re 
R2 Ri - R2 
R1 X(I1) - X(13) 
X(I1) = X(I1) + X(13) 
X(I3) = R2 
R2, =) C10): = NC TQ) 


R1 
X(10) 
R2 

Y(I1) 
X(12) 


Y(I0) = Y(IO) + Y(I2) 


Y(I2) =-R1i + R2 
Y(I3) = Ri + R2 
CONTINUE 
IS = 2*ID - N2 +1 
ID = 4*ID 
IF (IS.LT.N) GOTO 2 
IE = N/N2 
LAD at 
DO 30. = 2 5 NA 
IA1 = IA1 + IE 
IA3 = 3*IA1 - 2 
CC1 = WR(IA1) 
SS1 = WI(IA1) 
CC3 = WR(IA3) 
SS3 = WI(IA3) 


Is = J 

ID = 2*N2 

DO 30 I0 = IS, N-1, 
Ii = 10 + N4 
12 = I1 + N4 
I3 = 12 + N4 
Ri = X(I0) - 
X(I0) = X(10) + 
R2 = X(TI1) - 
XO) = Cy 
Sl = YCIO) - 
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ID 


X(I2) 
X(I2) 
X(I3) 
X(I3) 
Y(I2) 
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30 


20 
10 


50 


15 


52*551 
R1*SS1 
R2*SS3 
53*553 


Y(I0O) = YCIO) + Y(I2) 
$2 = Y(I1) - Y(I3) 
Y(I1) = Y(I1) + Y(I3) 
$3 = Ri - 82 
Ri = Ri + $2 
$2 = R2 - Si 
R2 = R2 + Si 
X(I2) = R1*CC1 - 
Y(I2) =-S2*CC1 - 
X(I3) = S$3*CC3 + 
Y(I3) = R2*CC3 - 
CONTINUE 
IS = 2*ID - N2 + J 
ID = 4*ID 
IF (IS.LT.N) GOTO 40 
CONTINUE 
CONTINUE 
IS = i 
ID = 32 
DO 60 I = IS, N, ID 
I0 =I+8 
D0 i5 J=1, 2 
Ri = X(1I0) + X(10+2) 
R3 = X(I0) - X(10+2) 
R2 = X(I0+1) + X(I0+3) 
R4 = X(1I0+1) - X(1I0+3) 
X(I0)  =R1 + R2 
X(I0+1) = R1 - R2 
Ri = Y(IO) + Y(10+2) 
S3 = YCIO) - Y(1I0+2) 
R2 = Y(IO+1) + YC(1I0+3) 
S4 = Y(1I0+1) - YCI0+3) 
Y(I0) = R1+ R2 
Y(I0+1) = R1 - R2 
Y(10+2) = $3 - R4 
Y(I0+3) = $3 + R4 
X(I0+2) = R3 + S4 
X(10+3) = R38 - S4 
Io = 10 + 4 
CONTINUE 
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186 


60 


55 


CONTINUE 
TS: ="24ID-=--15 
ID = 4*ID 

IF (IS.LT.N) GOTO 50 

TS =. ~ 4 

ID = 16 

DO 65 I0 = IS, N, ID 

Ri = X(I0) + X(10+4) 

R5 = X(I0) = - ~X(I0+4) 
R2 = X(10+1) + X(I0+5) 
R6 = X(10+1) - X(I0+5) 
R3 = X(10+2) + X(I0+6) 
R7 = X(10+2) - X(I0+6) 
R4 = X(I0+3) + X(I0+7) 
R8 = X(I0+3) - X(I0+7) 
Ti = Ri = 3 
Ri = Ri + R3 
R3 = R2 - R4 
R2 = R2 + R4 
X(I0) = Ri + R2 
X(10+1) = R1 - R2 
Ri = Y(I0) + Y(I0+4) 
$5 = Y(IO) = - Y(1I0+4) 
R2 = Y(10+1) + Y(IO+5) 
S6 = Y(IO+1) - Y(I0+5) 
S3 = Y(I0+2) + Y(I0+6) 
S7 = Y(I0+2) - Y(1I0+6) 
R4 = Y(I0+3) + Y(I0+7) 
$8 = Y(10+3) - Y(1I0+7) 
TO: = RI, 753 
Ri = Ri + $3 
S3 = R2 - R4 
R2 = R2 + R4 
Y(I0) = Ri + R2 
Y(I0+1) = R1 - R2 
X(I0+2) = Ti + $3 
X(I0+3) = Ti - $3 
Y(10+2) = T2 - R3 
Y(10+3) = T2 + R3 
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65 


101 
102 


Ri = (R 
R6 = (R 
R2 = (S 
Ss6 = (S 
Ti = R5 
R5 = Rd 
R8 = R7 
R7 = R7 
T2 = $5 
S5 = S5 
S8 = S7 
S7 = S7 
X(10+4) 
X(10+7) 
X(I0+5) 
X(10+6) 
Y(10+4) 
Y (10+7) 
Y(I0+5) 
Y(I0+6) 
CONTINUE 
IS = 2* 
ID = 4* 


6 - R8)*C81 
6 + R8)*C81 
6 - S8)*C81 
6 + $8) *C81 
- Ri 

+ Rl 

- R6 

+ R6 

- R2 

+ R2 

- 56 

+ $6 

= R5 + $7 
= R5 - $7 
= T1 + 88 
= T1 - 88 
= $5 - R7 
= $5 + R7 
= T2 - R8 
= T2 + R8 
ID - 7 

ID 


IF (IS.LT.N) GOTO 55 


Jet 
N1=N-1 
DO 104 I=1, 


IF (I.GE.J) GOTO 101 


LT SRC 
X(J) = 
X(I) = 
RTS? <2 
Y¥(J) = 
Y(I) = 
K = N/2 


IF (K.GE.J) GOTO 103 


iS 
K = 
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Ni 


J) 
X(1) 
AT 
YC) 
Y(I) 
XT 


J -K 
K/2 
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GOTO 102 
103 J=J+kK 
104 CONTINUE 
RETURN 
END 


18.12 Prime Factor FFT Algorithm 


Below is the Fortran code for a Prime-Factor Algorithm (PFA) FFT allowing factors of the 
length of 2, 3, 4, 5, and 7. It is followed by an unscrambler. 


A PRIME FACTOR FFT PROGRAM WITH GENERAL MODULES 
COMPLEX INPUT DATA IN ARRAYS X AND Y 
COMPLEX OUTPUT IN A AND B 
LENGTH N WITH M FACTORS IN ARRAY NI 
N = NI(41)*NI(2)* ... *NI(M) 
UNSCRAMBLING CONSTANT UNSC 
UNSC = N/NI(1) + N/NI(2) +...+ N/NI(M), MOD N 
Cc. S. BURRUS, RICE UNIVERSITY, JAN 1987 


ARARMQARAAQARQARARMX 


QQ 


SUBROUTINE PFA(X,Y,N,M,NI,A,B,UNSC) 


INTEGER NI(4), 1(16), UNSC 
REAL %C1) YC), AG)» BC) 


DATA C31, C32 
DATA C51, C52 
DATA C53, Cb4 
DATA C55 

DATA C71, C72 
DATA C73, C74 
DATA C75, C76 
DATA C77, C78 


-0.86602540,-1.50000000 / 
0.95105652,-1.53884180 
-0.36327126, 0.55901699 
=A5~ 

-1.16666667 ,-0.79015647 
0.055854267, 0.7343022 
0.44095855, -0.34087293 
0.53396936, 0.87484229 


TA A TATE 
“WS 


he, TSO, OS 


Cieeaesseceusscess NESTED, 00P$-2s2+s-2-2e2-2e-5--55 
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N2 = N/N1 
DO 15 J=1, N, Ni 
it. Veja, 
DO 30 L=1, Ni 
I(L) = IT 
A(L) = X(IT) 
B(L) = Y(IT) 
IT = IT + N2 
IF (IT.GT.N) IT=IT-N 
30 CONTINUE 
GOTO (20,102,103,104,105,20,107), N1 
Cc 
Catetsoei a onh ae WET A NSO eres fares ahs Gar ta rhe 
C 
102. Ri = A(1) 
AC) “SR + 0) 
MOY - SRLS ae) 
Cc 
Ri = B(1) 
B(1) = R41 + B(2) 
B(2) = R41 - B(2) 
CG 
GOTO 20 
Qossesceeseessess WETAS N=9s222eSseceenec tees: bene ces ee oasis 
Cc 


103 =R2 = (A(2) - AC(3)) * C31 
Ri = A(2) + A(3) 
A(1)= A(1) + Ri 
Ri = A(1) + R1 * C32 


Cc 
$2 = (B(2) - B(3)) * C31 
Si = B(2) + B(3) 
B(1i)= B(1) + S14 
S1 = B(1) + S1 * C32 
C 
A(2) = Ri - 82 
A(3) = R1 + S82 
B(2) = Si + R2 
B(3) = Si - R2 
Cc 
GOTO 20 
Cc 
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104 
Ti 
Ro = 
A(1) 
A(3) 


Ri 
T2 
Rov 
B(1) 
B(3) 


Ri = 
R2 


A(2) 
A(4) 
B(2) 
B(4) 


GOTO 


105 
R4 
R3 
R2 


T 
Ri-= 
A(1) 
Ri 


R3 = 
Ri = 


T= 
R4 = 
R2 = 


Ri = A(1) + A(3) 


A(1) - A(3) 
A(2) + A(4) 
= Ri + R2 
= Ri - R2 
B(1) + B(3) 
B(1) - B(3) 
B(2) + B(4) 
= Ri + R2 
= Ri - R2 
A(2) - ACA) 
B(2) - B(4) 
= T1 + R2 
= T1 - R2 
= T2 - Ri 
= T2 + Ri 
20 
-~-------- WFTA N=5-------------------------------- 


Ri = A(2) + A(5) 


A(2) - A(5) 
A(3) + A(4) 
A(3) - A(4) 


(Ri - R3) * C54 

Ri + R3 

= A(1) + Ri 

= A(1) + R1 * C55 


Ri - T 
Ri + T 


(R4 + R2) * C51 


T + R4 * C52 
T + R2 * C53 
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Cc 
S1 = B(2) + B(5) 
S4 = B(2) - B(5) 
S3 = B(3) + B(4) 
S2 = B(3) - B(4) 
Cc 
T= ($1 = $3) * C54 
Si = S1 + $3 
B(1) = B(1) + S14 
Si = B(1) + Si * C55 
Cc 
63:2: Si eT 
Sis: Sia Tt 
C 
T = (S4 + $2) * C51 
S4 = T+ S4 * C52 
S2 = T+ $2 * C53 
Cc 
A(2) = Ri + 82 
A(5) = Ri - 82 
A(3) = R3 - S84 
A(4) = R3 + S84 
C 
B(2) = Si - R2 
B(5) = Si + R2 
B(3) = $3 + R4 
B(4) = $3 - R4 
@ 
GOTO 20 
Qe een Se feeee eee WET Na($2seesnseb ese eee cee 
Cc 


107 Ri = A(2) + A(T) 
R6 = A(2) - A(7) 
S1 = B(2) + B(7) 
S6 = B(2) - B(7) 
R2 = A(3) + A(6) 
R5 = A(3) - AC6) 
S2 = B(3) + B(6) 
S5 = B(3) - B(6) 
R3 = A(4) + A(5) 
R4 = A(4) - A(5) 
S3 = B(4) + B(5) 
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54 


T3 
T 
R1 
AQ 
R1 


) 


B(4) - B(5) 


(Ri - R2) * C74 
(R1 - R3) * C72 
Ri + R2 + R3 

= A(1) + R1 

= A(i) + Ri * C71 


R2 =(R3 - R2) * C73 


R3 
R2 
R1 
T 

T3 
R6 
R5 
R4 
R5 
R6 


T3 
T 
S1 
B(1 
Si 
52 
53 
52 
S1 
T 
T3 
56 
55 
54 
55 
56 


A(2 
A(7 
A(3 
A(6 


) 


) 
) 
) 
) 


A(4) 


ACS 


) 


Rai se2 74 RO 
Ri - R2 - T3 
Ri + T + 73 
(R6 - R5) * C78 
(R6 + R4) * C76 
(R6 + R5 - R4) * C75 
(R5 + R4) * C77 
R6 - T3 + RS 
R6 - R5 - T 
R6 + T3 + T 
(Si - S2) * C74 
(S1 - $3) * C72 
Si + $2 + $3 
= B(1) + St 
= B(i) + Si * C71 
(S3 - S2) * C73 


Sl - T + 82 
Sl - S82 - T3 
Sl + T + T3 


(S6 - S5) * C78 

(S6 + S4) * C76 

(S6 + S5 - S4) * C75 
(S5 + S4) * C77 

56 - T3 + $5 

86 - 85 - T 

S6 + T3 + T 


= R3 + S84 
= R3 - S84 
= R1 + S6 
= R1 - S86 
= R2 - $5 
= R2 + 85 
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B(4) = $2 + R5 
B(5) = S2 - R5 
B(2) = S3 - R4 
B(7) = S3 + R4 
B(3) = S1 - R6 
B(6) = S1 + R6 


C 
20 IT =J 
DO 31 L=1, Ni 
I(L) = IT 
X(IT) = ACL) 
Y(IT) = B(L) 
IT = IT + N2 
IF (IT.GT.N) IT = IT -N 
31 CONTINUE 
15 CONTINUE 
10 CONTINUE 
C 
CSesSese sores esue= UNSCRAMBLING---------------------- 
C 
L=1 
DO 2 K=1, N 
A(K) = X(L) 
B(K) = Y(L) 
L = L + UNSC 
IF (L.GT.N) L=L-N 
2 CONTINUE 
RETURN 
END 
C 


18.13 In Place, In Order Prime Factor FFT Algorithm 


Below is the Fortran code for a Prime-Factor Algorithm (PFA) FFT allowing factors of 
the length of 2, 3, 4, 5, 7, 8, 9, and 16. It is both in-place and in-order, so requires no 
unscrambler. 


C 

A PRIME FACTOR FFT PROGRAM 

IN-PLACE AND IN-ORDER 

COMPLEX INPUT DATA IN ARRAYS X AND Y 
LENGTH N WITH M FACTORS IN ARRAY NI 


QAQAQAQ2 
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C N = NI(1)*NI(2)*...*NI(M) 
C REDUCED TEMP STORAGE IN SHORT WFTA MODULES 
C Has modules 2,3,4,5,7,8,9,16 
C PROGRAM BY C. S. BURRUS, RICE UNIVERSITY 
C SEPT 1983 
C eee ese ees 2 See Se A he ae ee eee Eee So eee Sa eee ee 
Cc 
SUBROUTINE PFA(X,Y,N,M,NI) 
INTEGER NI(4), 1(16), IP(16), LP(16) 
REAL X(1), Y(1) 
DATA C31, C32 / -0.86602540,-1.50000000 / 
DATA C51, C52 / 0.95105652,-1.53884180 / 
DATA C53, C54 / -0.36327126, 0.55901699 / 
DATA C55 fi: Ae yf. 
DATA C71, C72 / -1.16666667,-0.79015647 / 
DATA C73, C74 / 0.055854267, 0.7343022 / 
DATA C75, C76 / 0.44095855,-0.34087293 / 
DATA C77, C78 / 0.53396936, 0.87484229 / 
DATA C81 / 0.70710678 / 
DATA C95 / -0.50000000 / 
DATA C92, C93 / 0.93969262, -0.17364818 / 
DATA C94, C96 / 0.76604444, -0.34202014 / 
DATA C97, C98 / -0.98480775, -0.64278761 / 
DATA C162,C163 / 0.38268343, 1.30656297 / 
DATA C164,C165 / 0.54119610, 0.92387953 / 
C 
Ci26s25 tees 2 NESTED LOOPS 25. -s25--5-5 Se eeeeee eee eee eee 
Cc 
DO 10 K=1, M 
Ni = NI(K) 
N2 = N/N1 
Le S44, 
N3 = N2 - N1*(N2/N1) 
DO 15 J= 1, Nl 
LP(J) = L 
L=L+ N3 
IF (L.GT.N1) L=L - M1 
15 CONTINUE 
CG 


DO 20 J=1, N, Nt 
[Ts J 
DO 30 L=1, Nil 
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I(L) = IT 
IP(LP(L)) = IT 
Lt S30T +: N2 
IF (IT.GT.N) IT= I1T-N 
30 CONTINUE 
GOTO (20,102,103,104,105,20,107,108,109, 
4 20,20,20,20,20,20,116) ,N1 
Cpeets 265525652 WETA Wa0ss2 508 Se tobe Sete Sees ese 
Cc 
102s Ri = X(I(1)) 
X(I1(1)) = Ri + X(I(2)) 
X(I1(2)) = Ri - X(I(2)) 
C 
Ri = Y(I(1)) 
Y(IP(1)) = R1 + Y(I(2)) 
Y(IP(2)) = R1 - Y(I(2)) 
Cc 
GOTO 20 
Cc 
(Sees ead WETA N20 sos Seo Se ee el ee ee 
Cc 


103 -R2 = (X(I(2)) - X(1(3))) * C31 
Ries) XCTQ)) FX) 
KCL(1))= A(T) ) te RI 


Ri = X(I(1)) + R41 * C32 
Cc 
$2 = (Y(I(2)) - YC(I(3))) * C31 
S1-= YC @)): 2 ¥(1G)) 
Y(I(1))= Y(I(1)) + S14 
S1 = Y(I(1)) + S1 * C32 
Cc 
X(IP(2)) = R1 - S2 
X(IP(3)) = R1 + S2 
Y(IP(2)) = S1 + R2 
Y(IP(3)) = S1 - R2 
Cc 
GOTO 20 
Cc 
(see Sees WEAN 243533525 ee pest als 
Cc 


104 Ri = X(I(1)) + X(1(3)) 
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Ti = X(1(1)) - X(1(3)) 
R2 = X(1(2)) + X(1(4)) 
X(IP(1)) = R1 + R2 
X(IP(3)) = R1 - R2 


c 
R1 = Y(I(1)) + Y(1(3)) 
T2'] YUE) ) = ¥a(3)) 
R2 = Y(1(2)) + Y(1(4)) 
Y(IP(1)) = R1 + R2 
Y(IP(3)) = R1 - R2 
Cc 
R1 = X(1(2)) - X(1(4)) 
R2 = Y(1(2)) - Y(1(4)) 
G 
X(IP(2)) = T1 + R2 
X(IP(4)) = Ti - R2 
VTP (2):) STD: SRA 
Y(IP(4)) = T2 + Ri 
C 
GOTO 20 
(ssooSseeessices WET: N=S22ssesss255s55555es he 5esenSe55 
C 


105 = R1 = XCI(2)) + X(I(5)) 
R4 = X(1(2)) - X(1(5)) 


R3 = X(1(3)) + X(1(4)) 
R2 = X(1(3)) - X(1(4)) 
Cc 
T = (Ri - R3) * C54 
Ri = Ri + R3 
X(1(4)) = X(1(1)) + R1 
Ri = X(I(1)) + R41 * C55 
G 
R3 = R1 - T 
Ri =Ri+T 
@ 
T = (R4 + R2) * C51 
R4 = T + R4 * C52 
R2 = T + R2 * C53 
C 
S1 = Y(I(2)) + Y(1(5)) 
S4 = Y(I(2)) - Y(1(5)) 
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$3 = Y(1(3)) + Y(I(4)) 
$2 = Y(1(3)) - Y(I(4)) 
Cc 
T= 481.5 °S3) s2054 
Si = S1 + $3 
YCRCL)) S.¥(CrG))- 4S 
S1 = Y(I(1)) + S41 * C55 
Cc 
§3 = Ssi-T 
Si = Si +T 
Cc 
T = (S4 + $2) * C51 
S4 = T+ $4 * C52 
$2 = T+ $2 * C53 
Cc 
X(IP(2)) = R1 + S2 
X(IP(5)) = R1 - S2 
X(IP(3)) = R3 - S4 
X(IP(4)) = R3 + S4 
Cc 
Y(IP(2)) = S1 - R2 
Y(IP(5)) = S1 + R2 
Y(IP(3)) = $3 + R4 
Y(IP(4)) = S3 - R4 
Cc 
GOTO 20 
Cee ee sae WETA NSfsosS Ss se es 
Cc 


107. RL = XCI(2)) + XCI(7)) 


R6 = X(1(2)) - X(1(7)) 
Si = ¥(I(2)) + Y(1(7)) 
BG = YC 02))) 2 -¥ CEC) 
R2 = X(1(3)) + X(1(6)) 
R5 = X(1(3)) - X(1(6)) 
S2 = ¥(I(3)) + Y(I(6)) 
S5 = Y(I(3)) - Y(I(6)) 
R3 = X(1(4)) + X(1(5)) 
R4 = X(1(4)) - X(1(5)) 
S3 = ¥(I(4)) + ¥(1(5)) 
$4 = ¥(1(4)) - Y¥(1(5)) 


Available for free at Connexions <http://cnx.org/content /col10550/1.22> 


198 


T3 = 

T = 

Ri = Ri + 
X(I(1)) = 
Ri = 
R2 =(R3 - 
R3 = Ri - 
R2 = Ri - 
Ri = Ri + 
T = (R6 - 
T3 =(R6 + 
R6 =(R6 + 
R5 =(R5 + 
R4 = R6 - 
R5 = R6 - 
R6 = R6 + 


(Rl - R2) * C74 
(Ri - R3) * C72 


R2 + R3 

RUA CT)). eRe 
X(I(41)) + R1 * C71 
R2) * C73 

To RD 

R2 - T3 

T + T3 

R5) * C78 

R4) * C76 

R5 - R4) * C75 
R4) * C77 

T3 + R5 

R5 - T 

T3 + T 


T3 = (S1 - $2) * C74 
T = (S1 - $3) * C72 
Sil = $1 + 82 + 83 


Y¥(I(1)) = 
Si = 
$2 =(S3 - 
53 = $1 - 
52 = $1 - 
Si = $1 + 
De: (36 
T3 = (S6 
S6 = (S6 
S5 = (S5 


+ 
+ 
+ 


¥.CIC1) )) ae St 
Y(I(1)) + S41 * C71 
$2) * C73 
T. 299 
$2 - T3 
T +173 
S5) * C78 
S4) * C76 
S5 - S4) * C75 
S4) * C77 


54 = 86 - T3 + $5 
585 = 86 - 85 - T 
86 = $6 + T3 + T 


X(IP(2)) 
X(IP(7)) 
X(IP(3)) 
X(IP(6)) 
X(IP(4)) 
X(IP(5)) 
Y(IP(4)) 
Y(IP(5)) 


R3 + S54 
R3 - S54 
Ri + S6 
Ri - S6 
R2 - 85 
R2 + 85 
52 + Rd 
52 - Rd 
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R4 


- R6 
+ R6 


Ri XCLC1)): = R CL) ) 
X(I(1)) - X(1(5)) 


+ 


+ 


X(1(8)) 
X(1(8)) 
X(1(7)) 
X(1(7)) 
X(1(6)) 


- X(1(6)) 


¥ CIP-(2)).-="S3 
Y(IP(7)) = $3 
Y(IP(3)) = St 
Y(IP(6)) = St 
Cc 
GOTO 20 
C Be at ee Sap at eye Sag eee 
Cc 
108 
ROs= 
R3 = X(1(2)) 
R4 = X(I(2)) 
R5 = X(1(3)) 
R6 = X(1(3)) 
R7 = X(1(4)) 
R8 = X(1I(4)) 
Ti = Ri + RS 
T2 = Ri = RS 
T3 = R3 + R7 
R3 =(R3 - R7) 


X(IP(1)) = T1 
X(IP(5)) = T1 


Ti 
T3 
Si 
R4 
52 
53 
R1 
R2 
R3 
R4 
R5 
R6 
R7 
R8 
T4 
R1 
R5 


R2 + R3 
R2 - R3 
R4 - R8& 
(R4 + R8) 
R4 + R6 
R4 - R6 
Y(I(1)) 
Y(I(1)) 
YG.@)) 
Y(I(2)) 
Y(1I(3)) 
Y(I(3)) 
Y(1(4)) 
Y(1(4)) 
Ri + R5 
Ri - R5 
R3 + R7 


+ 


+ 1 


* C81 


* C81 


Y(1I(5)) 
Y(1I(5)) 
Y(1I(8)) 
Y(1I(8)) 
Y(I(7)) 
Y(I(7)) 
Y(1(6)) 


- Y(1(6)) 


R3 =(R3 - R7) * C81 
YCIP(1)) = T4 + R5 
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Y(IP(5)) = T4 - R5 


R5 = R2 + R3 
R2 = R2 - R3 
R3 = R4 - R8 
R4 =(R4 + R8) * C81 
R7 = R4 + R6 
R4 = R4 - R6 


X(IP(2)) = T1 + R7 
X(IP(8)) = T1 - R7 
X(IP(3)) = T2 + R3 
X(IP(7)) = T2 - R3 
X(IP(4)) = T3 + R4 
X(IP(6)) = T3 - R4 
Y(IP(2)) = R5 - $2 
Y(IP(8)) = R5 + $2 
Y(IP(3)) = R1 - S1 
Y(IP(7)) = R1 + S1 
Y(IP(4)) = R2 - $3 
Y(IP(6)) = R2 + $3 


GOTO 20 


109 RL = X(I(2)) + X(I(9)) 
R2 = X(1(2)) - X(1(9)) 


R3 = X(1(3)) + X(1(8)) 
R4 = X(1(3)) - X(1(8)) 
RS = X(1(4)) + X(1(7)) 


T8 =(X(1(4)) - X(1(7))) * C31 


R7 = X(1(5)) + X(1(6)) 
R8 = X(1(5)) - X(1(6)) 
TO = X(I(1)) + R5 

T7 = X(I(1)) + R5 * C95 
R5 = Ri + R3 + R7 
X(I(1)) = TO + RS 


T5 = TO + R5 * C95 
T3 = (R3 - R7) * C92 
R7 = (R1 - R7) * C93 
R3 = (R1 - R3) * C94 
Ti = T7 + T3 + R3 
T3 = T7 - T3 - R7 
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T7 = T7 + R7 - R3 
T6 = (R2 - R4 + R8) * C31 
T4 = (R4 + R8) * C96 
R8 = (R2 - R8) * C97 
R2 = (R2 + R4) * C98 
12] To T4A-+ RO 
T4 = TS a. TA -<ORS 
T8 = T8 + R8 - R2 

Cc 
R1 = Y(1(2)) + Y(1(9)) 
R2 = Y(1(2)) - Y(1(9)) 
R3 = Y(1(3)) + Y(1(8)) 
R4 = Y(1(3)) - Y(1(8)) 
R5 = Y(1(4)) + Y(1(7)) 
R6 =(Y(1(4)) - Y(I(7))) * C31 
R7 = Y(1I(5)) + Y(1(6)) 
R8 = Y(1I(5)) - Y(1(6)) 
TO = Y(I(1)) + R5 
T9 = Y(I(1)) + R5 * C95 
R5 = Ri + R3 + R7 
Y(I(1)) = TO + R5 
R5 = TO + R5 * C95 
TO = (R3 - R7) * C92 
R7 = (R1 - R7) * C93 
R3 = (Ri - R3) * C94 
Ri = T9 + TO + R3 
TO = T9 - TO - R7 
R7 = T9 + R7 - R3 
RO = (R2 - R4 + R8) * C31 
R3 = (R4 + R8) * C96 
R8 = (R2 - R8) * C97 
R4 = (R2 + R4) * C98 
R2 = R6 + R3 + R4 
R3 = R6 - R8 - R3 
R8 = R6 + R8 - R4 

Cc 
YUP) = TL Ro 
X(IP(9)) = T1 + R2 
Y(IP(2)) = R1 + T2 
Y(IP(9)) = R1 - T2 
X(IP(3)) = T3 + R3 
X(IP(8)) = T3 - R3 
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Y(IP(3)) = TO - T4 
Y(IP(8)) = TO + T4 
X(IP(4)) = T5 - RO 
X(IP(7)) T5 + RY 
Y(IP(4)) = R5 + T6 
Y(IP(7)) = R5 - T6 
X(IP(5)) = T7 - R8 
X(IP(6)) T7 + R8 
Y(IP(5)) = R7 + T8 
Y(IP(6)) = R7 - T8 

@ 
GOTO 20 

G2ao Soe Sooner WRIA. Na169325-25 5-55 50255 2os56c8 

Cc 

116 Ri = X(I(1)) + X(I(9)) 

R2 = X(1I(1)) - X(1(9)) 
R3 = X(I(2)) + X(1I(10)) 
R4 = X(I(2)) - X(1I(10)) 
RS= 3(1(3)) + RCE) 
R6 = X(I(3)) - X(1I(11)) 
R7 = X(I(4)) + X(1(12)) 
R8 = X(I(4)) - X(1(12)) 
ROe=X(1 05) ) + -C13):) 
R10= X(I(5)) - X(1(13)) 
Rit = X(1(6)) + X(1(14)) 
R12 = X(1(6)) - X(1(14)) 
R13 = X(1(7)) + X(1(15)) 
R14 = X(1(7)) - X(1(45)) 
R15 = X(1(8)) + X(1(16)) 
R16 = X(1(8)) - X(1(16)) 
Ti = Ri + RO 
T2=-RL = RO 
T3 = R3 + Ril 
T4 = R3 - Ril 
T5 = R5 + R13 
T6 = R5 - R13 
T7 = R7 + R15 
T8 = R7 - R15 
Ri. = Ti+ 15 
R3 = T1 = 75 
R5 = T3 + T7 


APPENDIX 


APPENDIX 


Re = To: 77, 
X(IP( 1)) = R1 + RS 
X(IP( 9)) = R1 - RS 


Ti 
T5 
R9 
R11 
R13 
R15 
Ti 
T2 
T3 
T4 
T5 
T6 
T7 
T2 
T6 
T7 
T8 
R2 
R4 
R6 
R8 
T7 
T2 
T4 
T6 
T8 
R10 
R12 


R14 


R16 
R1 
52 
53 
54 
R5 
56 
S7 
58 
59 


C81 * (T4 + T8) 
C81 * (T4 - T8) 


T2 + T5 
T2 - T5 
T6 + Tl 
T6 - Tl 
R4 + R16 
R4 - R16 


C81 * (R6 + R14) 
C81 * (R6 - R14) 
R8 + R12 

R8 - R12 

C162 * (T2 - T6) 
C163: #-T2:2. T7 
C164: 2° T6.= TZ 

R2 + T4 

RQ2 14 

T7212 

Tae. To 

T8 + T6 

T8 - T6 

C165 * (T1 + T5) 
U7 C164. #71 

T7 - C163 * T5 
R10 + T3 

R10 - T3 

T6 + T2 

16. 12 

T8 + T4 

T8 - T4 

Y(I(1)) + Y(1(9)) 
Y(I(1)) - Y(1(9)) 
YOO))ce YEH) 
Y(I(2)) - Y¥(1I(10)) 
WTS) VCC) 
Y¥(I(3)) - Y¥(I(11)) 
Y(I(4)) + Y¥(1I(12)) 
Y(I(4)) - Y¥(1I(12)) 
¥C105) ): = ¥(1013)) 
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S10= Y¥(I(5)) - ¥(1(13)) 


S11 = Y(I(6)) 
$12 = Y(I(6)) 
$13 = Y(I(7)) 
$14 = Y(I(7)) 
$15 = Y(I(8)) 
$16 = Y(I(8)) 
Ti = Ri + S9 
TOS Ri. 4299 
T3 = $3 + Sil 
T4. = 63° =. 911 
T5 = R5 + $13 
T6 = R5 - $13 
T7 = S7 + $15 
T8 = S7 - S15 
Re = The 5 
§3 = T1 = T5 
R5 = T3 + T7 
S7 =13 =-T7 
Y(IP( 1)) = Ri 
Y(IP( 9)) = R1 
X(IP( 5)) = R3 
X(IP(13)) = R3 
Y(IP( 5)) = 83 
Y(IP(13)) = $3 
Ti = C81 * (T4 
T5 = C81 * (T4 
S9 = T2 + T5 
Sti] 12 55 
$13 = T6 + T1 
S15 = T6211 
Ti = $4 + S16 
T2 = S4 - S16 
T3 = C81 * (S6 
T4 = C81 * (S6 
T5 = $8 + $12 
T6288 2-912 
Te = C162 #-(T 
T2 = C163 4 ‘T2 
T6 = C164 * T6 
T7 = $2-+ -T4 
To =]60 574 
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+ ¥(I(14)) 
- Y(1(14)) 
+ ¥(I(15)) 
- ¥(1(15)) 
+ CT (16) 
- ¥(1(16)) 


+ R5 
- R5 
+ S7 
- 87 
- R7 
+ R7 
+ T8) 
- T8) 


+ $14) 
- $14) 


9-276) 
ae 
eo Te 
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So = 17 +19 
§4.=°T7 = T2 

S6 = T8 + T6 

S8 = T8 - T6 

T7 = Ci65 * (Ti + T5) 
T2:= Ty =-C164. * T1 
T4.= T7'= C163 *- TS 
T6 = $10 + T3 

T8 = $10 - T3 

$10 = T6 + T2 

$12 = T6 - T2 

S142] Teas T4 

S16 = T8 - T4 

X(IP( 2)) = R2 + S10 
X(IP(16)) = R2 - S10 
Y(IP( 2)) = $2 - R10 
Y(IP(16)) = S2 + R10 
X(IP( 3)) = R9 + S13 
X(IP(15)) = R9 - $13 
Y(IP( 3)) = S9 - R13 
Y(IP(15)) = S9 + R13 
X(IP( 4)) = R8 - S16 
X(IP(14)) = R8 + S16 
Y(IP( 4)) = S8 + R16 
Y(IP(14)) = S8 - R16 
X(IP( 6)) = R6 + S14 
X(IP(12)) = R6 - S14 
Y(IP( 6)) = S6 - R14 
Y(IP(12)) = S6 + R14 
X(IP( 7)) = R11 - S15 
XEIP (1) = REL S15 
Y(IP( 7)) = Si1 + R15 
YUPC11)) = 841.— R15 
X(IP( 8)) = R4 - S12 
X(IP(10)) = R4 + $12 
Y(IP( 8)) = S4 + R12 
Y(IP(10)) = S4 - R12 
GOTO 20 
20 CONTINUE 


10 CONTINUE 
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RETURN 
END 
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Appendix 4: Programs for Short FF'Ts'’ 


This appendix will discuss efficient short FFT programs that can be used in both the Cooley- 
Tukey (Chapter 9) and the Prime Factor FFT algorithms (Chapter 10). Links and references 
are given to Fortran listings that can be used "as is" or put into the indexed loops of existing 
programs to give greater efficiency and/or a greater variety of allowed lengths. Special 
programs have been written for lengths: N = 2, 3, 4, 5, 7, 8, 9, 11, 138, 16, 17, 19, 25, 
etc. 


In the early days of the FFT, multiplication was done in software and was, therefore, much 
slower than an addition. With modem hardware, a floating point multiplication can be done 
in one clock cycle of the computer, microprocessor, or DSP chip, requiring the same time as 
an addition. Indeed, in some computers and many DSP chips, both a multiplication and an 
addition (or accumulation) can be done in one cycle while the indexing and memory access 
is done in parallel. Most of the algorithms described here are not hardware architecture 
specific but are designed to minimize both multiplications and additions. 


The most basic and often used length FFT (or DFT) is for N = 2. In the Cooley Tukey FFT, 
it is called a "butterfly" and its reason for fame is requiring no multiplications at all, only 
one complex addition and one complex subtraction and needing only one complex temporary 
storage location. This is illustrated in Figure 1: The Prime Factor and Winograd Transform 
Algorithms (Figure 10.1) and code is shown in Figure 2: The Prime Factor and Winograd 
Transform Algorithms (Figure 10.2). The second most used length is N = 4 because it is the 
only other short length requiring no multiplications and a minimum of additions. All other 
short FFT require some multiplication but for powers of two, N = 8 and N = 16 require 
few enough to be worth special coding for some situations. 


Code for other short lengths such as the primes N = 3, 5, 7, 11, 13, 17, and 19 and the 
composites N =9 and 25 are included in the programs for the prime factor algorithm or 
the WFTA. They are derived using the theory in Chapters 5, 6, and 9. They can also be 
found in references ... and 


If these short FFTs are used as modules in the basic prime factor algorithm (PFA), then 
the straight forward development used for the modules in Figure 17.12 are used. However if 
the more complicated indexing use to achieve in-order, in-place calculation used in {xxxxx} 





!This content is available online at <http://cnx.org/content /m17646/1.4/>. 
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require different code. 
For each of the indicated lengths, the computer code is given in a Connexions module. 


They are not in the collection Fast Fourier Transforms? as the printed version would be too 
long. However, one can link to them on-line from the following buttons: 


N=2? 
N=3* 
N=4° 
N=5° 
N=7" 
N= 8 
N= 9 
Ne 4 
N= 13 
N= 16 
N= 17 
N= 19 
N= 25 


Versions for the in-place, in-order prime factor algorithm {pfa} can be obtained from: 


N=28 
N=3° 
n=4!0 
N=5!! 
Ne? 
N=8!3 
N=94 
N=11"° 
N=13'6 





?Fast Fourier Transforms <http://cnx.org/content /col10550/latest /> 
3"N=2" <http://cnx.org/content /m17625 /latest /> 

4"N—3" <http://cnx.org/content /m17626/latest /> 

°"N—4" <http://cnx.org/content /m17627/latest /> 

6"N—5" <http://cnx.org/content /m17628/latest /> 

™N—7" <http://cnx.org/content /m17629/latest /> 

8"pDN=2" <http://cnx.org/content /m17631 /latest /> 

°"DN=3" <http://cnx.org/content /m17632 /latest /> 

10" DN=4" <http://cnx.org/content /m17633 /latest /> 

1"5N=5" <http://cnx.org/content /m17634/latest /> 

12")N=7" <http://cnx.org/content /m17635 /latest /> 
13"5DN=8" <http://cnx.org/content /m17636 /latest /> 
14" 5N=9" <http://cnx.org/content /m17637 /latest /> 
15"N = 11 Winograd FFT module" <http://cnx.org/content /m17377 /latest /> 
'6"N — 13 Winograd FFT module" <http://cnx.org/content /m17378 /latest /> 
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N=16!” 
N=17!8 
N=19"° 
N=257° 


A technical report that describes the length 11, 13, 17, and 19 is in {report 8105} and 
another technical report that describes a program that will automatically generate a prime 
length FFT and its flow graph si in {report xxx}. 





'™"N = 16 FFT module" <http://cnx.org/content /m17382/latest / > 

18"N — 17 Winograd FFT module" <http://cnx.org/content /m17380/latest /> 
19"N = 19 Winograd FFT module" <http://cnx.org/content /m17381 /latest /> 
20"N = 25 FFT module" <http://cnx.org/content /m17383/latest /> 
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Fast Fourier Transforms 

This book uses an index map, a polynomial decomposition, an operator factorization, and a 
conversion to a filter to develop a very general and efficient description of fast algorithms to 
calculate the discrete Fourier transform (DFT). The work of Winograd is outlined, chapters 
by Selesnick, Pueschel, and Johnson are included, and computer programs are provided. 


About Connexions 

Since 1999, Connexions has been pioneering a global system where anyone can create course 
materials and make them fully accessible and easily reusable free of charge. We are a Web- 
based authoring, teaching and learning environment open to anyone interested in education, 
including students, teachers, professors and lifelong learners. We connect ideas and facilitate 
educational communities. 


Connexions’s modular, interactive courses are in use worldwide by universities, community 
colleges, K-12 schools, distance learners, and lifelong learners. Connexions materials are in 
many languages, including English, Spanish, Chinese, Japanese, Italian, Vietnamese, French, 
Portuguese, and Thai. Connexions is part of an exciting new information distribution system 
that allows for Print on Demand Books. Connexions has partnered with innovative on- 
demand publisher QOOP to accelerate the delivery of printed course materials and textbooks 
into classrooms worldwide at lower prices than traditional academic publishers. 


